class: center, middle, inverse, title-slide # Topic Models and EDA ## JHU Data Science ### www.jtleek.com/advdatasci --- class: inverse, middle, center # Topic Models are a > type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. .footnote[https://en.wikipedia.org/wiki/Topic_model] --- class: inverse background-image: url(../imgs/topic/latent.png) background-size: 80% background-position: center .footnote[https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf] --- class: inverse background-image: url(../imgs/topic/genetic_paper.png) background-size: 80% background-position: center .footnote[https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf] --- class: inverse ## Topic: A probability distribution over a fixed vocabulary <div style='font-size:30pt'> Conceptual algorithm: </div> <br> <div style='font-size:30pt'> 1. Randomly choose a distribution over topics <br> 2. For each word in the document Randomly choose a topic using 1. Randomly choose a word using the topic. </div> --- class: inverse <img src="../imgs/topic/graph_model.png" style="width:80%"> .pull-left[ - β - topic distribution - w - observed words - θ - topic proportions in document - z - topic assignment in document ] .pull-right[ - α - prior on proportions - D - \# of documents - K - \# of topics - η - prior on # of topics ] --- class: inverse background-image: url(../imgs/topic/real_inf.png) background-size: 100% background-position: center --- class: inverse background-image: url(../imgs/topic/real_inf_hand.png) background-size: 100% background-position: center --- class: inverse, middle, center # Topic models lab <font color="red" style='font-size:40pt'> https://goo.gl/0bcib4 </font> --- class: inverse, middle, center # Exploratory data analysis --- class: inverse background-image: url(../imgs/topic/book.png) background-size: 40% background-position: center # A good book .footnote[https://leanpub.com/exdata] --- class: inverse ## Steps in an EDA .huge[ .pull-left[ - Read in data - Figure out what it is - Pre-process it - Look at dimensions - Look at values (str) ] .pull-right[ - Make tables - Hunt for messed up values - Hunt for NAs - Plot it - Don't fool yourself ] ] --- class: inverse background-image: url(../imgs/topic/vc.png) background-size: 60% background-position: center # Example --- class: inverse background-image: url(../imgs/topic/elephant.png) background-size: 80% background-position: center .footnote[https://en.wikipedia.org/wiki/Blind_men_and_an_elephant#/media/File:Blind_monks_examining_an_elephant.jpg] --- class: inverse background-image: url(../imgs/topic/nature.png) background-size: 80% background-position: center --- class: inverse ## simplystats data - what is it? ```r tdir = file.path(tempdir(), "ss") x = git2r::clone(url = "https://github.com/simplystats/simplystats.github.io", local_path = tdir,progress = FALSE) posts = file.path(tdir, "_posts") library(tm) suppressPackageStartupMessages({library(dplyr)}) ds = DirSource(posts) simply = VCorpus(ds) class(simply); length(simply) ``` ``` [1] "VCorpus" "Corpus" ``` ``` [1] 965 ``` ```r str(simply[[1]]) ``` ``` List of 2 $ content: chr [1:14] "---" "title: Example post" "author: jeff" "layout: post" ... $ meta :List of 7 ..$ author : chr(0) ..$ datetimestamp: POSIXlt[1:1], format: "2017-09-20 21:18:03" ..$ description : chr(0) ..$ heading : chr(0) ..$ id : chr "2011-09-01-examplepost.md" ..$ language : chr "en" ..$ origin : chr(0) ..- attr(*, "class")= chr "TextDocumentMeta" - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument" ``` --- class: inverse ## simplystats data - preprocess ```r library(tidytext) tidy_simply = simply %>% tidy %>% unnest_tokens(word,text) %>% select(author,datetimestamp,id,word) dim(tidy_simply) ``` ``` [1] 540717 4 ``` ```r str(tidy_simply) ``` ``` Classes 'tbl_df', 'tbl' and 'data.frame': 540717 obs. of 4 variables: $ author : logi NA NA NA NA NA NA ... $ datetimestamp: POSIXct, format: "2017-09-20 17:18:03" "2017-09-20 17:18:03" ... $ id : chr "2011-09-01-examplepost.md" "2011-09-01-examplepost.md" "2011-09-01-examplepost.md" "2011-09-01-examplepost.md" ... $ word : chr "title" "example" "post" "author" ... ``` --- class: inverse ## simplystats data - look at dimensions and values ```r tidy_simply %>% as.data.frame %>% head(2) ``` ``` author datetimestamp id word 1 NA 2017-09-20 17:18:03 2011-09-01-examplepost.md title 2 NA 2017-09-20 17:18:03 2011-09-01-examplepost.md example ``` ```r tidy_simply %>% as.data.frame %>% tail(2) ``` ``` author datetimestamp id word 540716 NA 2017-09-20 17:18:04 2017-05-04-debt-haircuts.md after 540717 NA 2017-09-20 17:18:04 2017-05-04-debt-haircuts.md 1990 ``` ```r tidy_simply %>% group_by(word) %>% count() ``` ``` # A tibble: 28,039 x 2 # Groups: word [28,039] word n <chr> <int> 1 __ 5 2 ___ 15 3 _____ 3 4 __abstract__ 1 5 __choose 1 6 __comparisons 1 7 __conclusions__ 1 8 __data 4 9 __give 1 10 __include 1 # ... with 28,029 more rows ``` ```r print(table(tidy_simply$word)) ``` ``` __ 5 ___ 15 _____ 3 __abstract__ 1 __choose 1 __comparisons 1 __conclusions__ 1 __data 4 __give 1 __include 1 __introduction__ 1 __level 8 __make 1 __methods__ 1 __moderate 1 __option 5 __post 1 __set 1 __share 2 __supplementary 1 __the 5 __this 1 __title__ 1 __use 1 __write 2 __your 1 _0 1 _2000 1 _8 1 _a 11 _a_new_low_for_tv_science_malware_fractals_in_bones_ampbull_videosiftcom 1 _a_structure_for_deoxyribose_nucleic_acid 1 _actually 2 _admit 1 _after 1 _after_ 2 _aggressive 1 _alma 1 _always_ 1 _american_crime_story 1 _an 13 _analyze_ 1 _and 3 _annals 4 _annie 1 _any_ 1 _are 1 _arrange_ 1 _art_ 1 _as 1 _ask 1 _association 1 _automatically_ 1 _average 1 _averages_ 1 _award 1 _b 7 _baltimore 2 _baron_may_of_oxford 1 _be_ 1 _benefits.pdf 1 _bigone 1 _biometrika_ 2 _biostatistics_ 4 _blank 1597 _box 1 _broad 2 _but 4 _can 2 _capital 1 _capstone 1 _china 1 _churning_ 1 _citation 1 _collect_ 1 _communicate_ 1 _completely_ 1 _controlled 2 _controls_ 1 _correction 1 _crazy 1 _d_ 13 _data 4 _data_ 1 _designed_ 1 _difficulty 15 _disciplinary 2 _disclaimer 3 _disclaimers 1 _docs 2 _does_ 1 _don't 2 _during 1 _e 18 _edit 2 _editor 4 _editor's 53 _editor’s 4 _encoding 1 _environmental 1 _equivalence 1 _estimate_ 1 _everything_ 1 _evidence_ 2 _exactly 1 _exaggerated 1 _f 6 _failing 1 _filter_ 1 _final_ 1 _for 1 _foreach_ 2 _format 1 _function_ 1 _gave 1 _general 1 _get 1 _getting 2 _ggplot2 1 _good 2 _googlesheets_ 1 _gym 1 _have 1 _health 1 _here 1 _hsenc 2 _hsmi 2 _i 5 _i_ 3 _if 1 _images 61 _in 7 _incidentally 1 _income 2 _independent 1 _is 4 _is_ 1 _isn't 1 _j 3 _jasa 1 _jasa_ 1 _jden 3 _journal 2 _jrss 2 _just 1 _k 2 _know_ 1 _lars 1 _leek 1 _len 1 _let 1 _lm1 2 _load 5 _load_policy 2 _mayo_ 2 _me_ 1 _microsoft_corp 1 _mis_ 1 _missing 1 _mo 1 _moneyball_ 1 _most 1 _mutate_ 1 _nature_ 5 _neuroimage_ 1 _new 1 _nine_ 1 _nlp 1 _no 1 _noninferiority_vs 1 _not 3 _not_ 5 _note 2 _note_ 1 _number 1 _o 2 _observational 1 _occur 4 _one 1 _only 2 _opening 1 _opinion_ 1 _or 1 _other_ 2 _outcome 4 _over 1 _p 2 _p_ 1 _paperpile_ 3 _part_ 1 _past_ 1 _people 1 _per_capita 1 _personal 1 _php 4 _piketty’s 1 _plos 2 _policy 5 _political 1 _possible_ 1 _post 1 _pre 4 _predict.lm 2 _prediction 1 _present_ 1 _pretty 1 _private 1 _proceedings 1 _projection_ 1 _prometheus 1 _psychological 1 _public 1 _publish_ 1 _put 1 _quick 1 _quickly_ 2 _r 68 _r_ 2 _raise_ 2 _randall 1 _rawest_ 1 _realize 1 _really_ 1 _reasonable_ 1 _redefine 1 _replicable_ 1 _reproducibility 1 _research 1 _respond 1 _right 2 _ripley 1 _run_anywhere 1 _s 1 _science 2 _science_ 7 _scientific_ 1 _scotusblog 1 _security 2 _see_ 1 _select_ 1 _self 3 _share_ 2 _shitz_ 1 _show 1 _simply 1 _simplystats 6 _simplystats_ 1 _simpson 1 _slice_ 1 _small 1 _so 1 _someone_ 1 _source 1 _statistical 2 _statistical_ 1 _statistics 1 _statistics_ 1 _stewart 1 _super_ 1 _suppose 1 _t 2 _t_ 4 _tell_ 1 _that's 1 _the 14 _the_ 2 _there 1 _thinking_ 1 _this 11 _thkdllaf 1 _tl 2 _to 1 _to_ 1 _ton_ 1 _treat 1 _true 1 _try 1 _tv_series 1 _tvh5edd22c 1 _two_ 1 _type 4 _uncontrolled 1 _update 8 _use 1 _very 1 _very_ 2 _w_articles_wsite3_1_26 1 _way_ 1 _we 6 _we've 1 _weren't_ 1 _what 30 _when 2 _whether 1 _whether_ 1 _which 1 _while 1 _why 2 _why_ 2 _will 1 _wilson 1 _with 3 _word 1 _wt 1 _x 4 _x_ 12 _xlen 1 _y 3 _y_ 8 _you 2 _you_ 5 _youden 1 _your_ 1 ˈdætə 1 ˈdɑːtə 1 ˈdeɪtə 1 0 476 0_1fcbc04421 1 0,1 10 0,10 2 0,10,1 1 0,22 1 0,5 2 0,6244826 1 0.0 1 0.000 1 0.000000299 1 0.000000358 1 0.000001 2 0.00001 1 0.000071 1 0.00007108 1 0.0001 3 0.000114 1 0.00013 1 0.000954 1 0.001 3 0.001049 1 0.00106785 1 0.0011 1 0.001244 1 0.0018 1 0.001819 1 0.001820 1 0.00182009 1 0.001860 1 0.001936 1 0.002 1 0.0036 1 0.004885 1 0.005 1 0.00933871 1 0.009478 1 0.0096 1 0.01 8 0.01165 1 0.0117 1 0.0140 4 0.0141 2 0.015 2 0.03 6 0.0377 1 0.04 3 0.0454 1 0.0489 1 0.05 28 0.06 1 0.0733 1 0.0869 1 0.0933 1 0.09341 1 0.09411 1 0.1 3 0.10 1 0.112284 1 0.1316 1 0.1324 1 0.14 1 0.15 1 0.16 2 0.18 1 0.19 1 0.2 1 0.215679 1 0.22 1 0.242 1 0.258 1 0.27 2 0.276 1 0.297049 1 0.29704909 1 0.298153 1 0.29815323 1 0.29841993 1 0.298420 1 0.3 2 0.30 2 0.302859 1 0.304346 1 0.304838 1 0.314 1 0.375 1 0.386 1 0.4 1 0.4546 1 0.454624 1 0.456883 1 0.4569 1 0.462371 1 0.4624 1 0.5 8 0.5,0.5,2,0.5 1 0.53 1 0.574mg 1 0.57j62.4254 2 0.6 1 0.648584 1 0.64858433 1 0.65 1 0.67 1 0.8 3 0.80 1 0.84 3 0.85 1 0.89 3 0.9 1 0.9.0 1 0.92 5 0.95 5 0.98 3 0.99 5 00 2888 000 3 000000 4 0001002 2 000313001300339969 1 00031305.2016.1154108 1 000551 1 001 4 00144feabdc0 3 0020124 2 0024532 2 0024914 1 002689 1 0026895 8 0030161 4 0036210 1 003665 1 003713 1 00401706.1972.10488878 1 005983 1 0062731025 1 006601 1 0066463 1 0085047 2 0089470 1 00pm 1 00rllkkregxnvtttdzzdn3tw24vgwarocrr75qipjazsaq5mapz8aqtdzl0wz12egzwbge3bt377d77ruhgwayiuw22wrh0kbbrwv6jgikkkcaagooubcbdrcattljjognn94osdhmhgikkkcaagoo0h6bdrcatx8tzqmaagoooiacchrlwacowpvhbhrqqaefffagbwedobyq3yqcciiggaikfevaakhy9wfuffbaaquuucahaqoghjddhaikkkcaagous8aaqfj1yw4uueabbrrqiaebdp8gn0pe3iqcwwj0799 2 01 658 013698 1 014 4 0143125702 1 016214501750332875 2 016214501753168082 1 016214503000071 1 016214504000000656 1 016214504000000683 1 01621459.1926.10502161 1 01621459.2011.645777 2 01621459.2013.808157 1 018150 1 01t01 1 01t08 1 01t09 3 01t10 2 01t11 2 01t12 1 01t13 8 01t14 6 01t15 4 01t16 1 01t18 2 01t19 2 01t21 2 01x 1 02 606 02_areas 2 02_index.shtml 2 020842 1 027516 1 02gujc0hh2ct1egocyxqizrfu91c72ea 7 02t01 1 02t09 3 02t10 5 02t12 1 02t13 6 02t14 9 02t15 3 02t16 3 02t17 3 02t19 1 02t20 1 02t21 4 03 509 0314policyforumff.pdf 3 0321774671 1 034009 1 038 4 0387981403 1 0387987746 1 0393310728 1 0393929728 3 03a78pfwrfy18sd3t 2 03koller.html 1 03t00 1 03t01 3 03t07 1 03t08 1 03t09 4 03t10 4 03t11 2 03t12 3 03t13 4 03t14 7 03t15 6 03t16 3 03t17 5 03t18 1 03t20 1 03t21 1 04 365 04_become 1 04_eligibility.shtml 1 0470944889 1 0471727814 1 0486260348 1 04t02 1 04t09 2 04t10 4 04t11 1 04t12 2 04t13 5 04t14 12 04t15 5 04t16 4 04t17 3 04t18 4 04t19 2 04t21 1 04t22 1 05 451 05.12.03 2 050575 1 0525953736 1 052915 1 0534 4 0544703391 1 0553418815 1 056xathhgiaimb5cphriw9ystnnarqprlicdggfinjpifn5m3xwwepotl8ukfw 2 05jkyp 2 05pdf 1 05t07 1 05t09 2 05t10 4 05t11 3 05t12 3 05t13 4 05t14 5 05t15 4 05t16 7 05t17 2 05t18 2 05t20 2 06 458 0606441 3 06112015 1 0636920018483 1 0636920021421 1 0636920034919 1 066803 3 067443000x 1 068478 1 06stats.html 1 06t00 1 06t01 1 06t06 1 06t07 1 06t09 1 06t10 4 06t11 4 06t12 3 06t13 6 06t14 5 06t15 2 06t16 5 06t17 1 06t18 3 07 522 07program.html 1 07t00 1 07t02 1 07t03 2 07t05 1 07t08 1 07t10 5 07t11 4 07t12 2 07t13 4 07t14 5 07t15 6 07t16 4 07t17 3 07t18 1 07t20 1 08 522 0810.4672 2 088278 1 08genes.html 2 08t00 2 08t01 1 08t02 2 08t09 1 08t10 5 08t11 1 08t12 1 08t13 1 08t14 8 08t15 6 08t16 5 08t17 3 08t18 3 08t19 2 08t20 4 08t21 1 08t23 1 09 523 0902.2183v2 2 090506 1 0919132824ac2090de45f2b1135b0163 1 09t00 1 09t01 2 09t08 1 09t09 2 09t10 4 09t11 2 09t12 3 09t13 7 09t14 3 09t15 8 09t16 2 09t17 2 09t19 5 09t20 1 09t21 1 09t23 1 09xzr 2 0ahukewi3usl8 1 0ap1000000121055 1 0atd3gd8kgn45de9bz1ptywtca0m2v 1 0atd3gd8kgn45de9bz1ptywtca0m2vwhkckroue9klve 1 0b678utpufn80a2rkouc5lw51cvu 1 0cb0qfjaa 1 0cgoqfjaa 1 0dcc 1 0fqy5c4yr3waxxxrr7eii 2 0i 1 0lm0h0gpo 2 0ntew2uttcj 2 0px 1 0sbijnxxva 2 0wdthhh 2 0x0 1 0y2ywntii 2 1 1660 1_ 1 1__ 1 1,000 15 1,000,000 2 1,000s 1 1,030,000 1 1,040 1 1,154 1 1,158 1 1,2 2 1,2,3 3 1,320,000 1 1,350.00 1 1,50 1 1,500 3 1,600 1 1,937.28 1 1.0 6 1.02.25 5 1.0300 1 1.030007 1 1.1 1 1.10298 2 1.10634 2 1.10937 3 1.11.22 5 1.11520 1 1.12413 2 1.12852 1 1.13.0 1 1.13187 1 1.13469 1 1.13812 1 1.13942 1 1.14131 1 1.14205 3 1.14586 1 1.14700 4 1.16 1 1.16232 2 1.16643 1 1.17001 1 1.17411 1 1.17412 4 1.17552 2 1.1k 1 1.2 1 1.25.01 7 1.28 1 1.3 1 1.31.38 6 1.32.45 10 1.4826 1 1.5 7 1.54 2 1.57 2 1.6 1 1.7 1 1.76 1 1.7e 1 1.855 1 1.88 1 1.8e 1 10 909 10,000 13 10,000,000 3 10,500 1 10.1002 2 10.1021 1 10.1056 12 10.1080 6 10.1089 1 10.1103 1 10.1111 5 10.1146 1 10.1198 8 10.1207 2 10.1371 19 10.14.53 3 10.15.04 2 10.16.32 6 10.16.54 5 10.3389 2 10.5 2 10.6 2 10.7 1 10.7910 1 100 123 100,000 7 1000 17 1000006298 1 10001 1 10005107 1 1000genomes.org 2 1001_3 1 1001093 2 10013120929 3 1001380691 1 1001736045 1 1002 2 1002106 4 10021164565 4 100217094 1 1003925 1 1004 2 1004w 1 1005 4 1005253374 1 1006 4 1006094136 1 1006252706 1 10068195751 5 1007 2 100721630 1 1007987592 1 1008 2 100893932596 1 1009 3 1009213726 1 100k 2 100m 1 100plus 5 100plus.com 3 100s 2 100vw 128 101 11 1010 2 1010.1092 1 1010.3003 1 10100207969753748 1 10101 1 1011 6 1011234490 1 1012 4 10124797490 4 1013 5 101313 1 1014500084 1 10147 1 10150407246723134 1 1016711311 1 1018192159 1 102 4 1020259799 1 10204192286 3 1023939249 1 1024 3 10241004305 3 10245243013 2 1024px 2 1024w 49 1024x1007 3 1024x1024 9 1024x182 1 1024x249 2 1024x279 2 1024x300 1 1024x334 2 1024x341 1 1024x463 3 1024x486 1 1024x507 1 1024x511 3 1024x548 6 1024x603 3 1024x608 3 1024x640 3 1024x690 1 1024x703 3 1024x704 1 1024x737 2 1024x768 10 1024x773 2 1024x779 2 1024x814 2 1024x818 3 1024x94 1 1025 2 10255632119 2 1025882709 1 1025883338 1 102614 1 1026527086 1 1026763525 1 1027 1 102713 1 10273897765 2 1029 2 1029149123 1 1029549249 1 103 2 1030 1 1031 2 1033 2 1033221837 1 1034910757 1 1035w 1 1036 1 10361220686 2 1036656879 1 1036px 2 1036w 1 104 2 1040 2 10402321009 3 10410458080 5 1041971025 1 1042601604 1 10440058465 2 10440612965 3 10441403664 4 10447922152 2 1044857889 1 1046 2 1046200794 1 1046909850 1 1047454398 1 10478244664 2 1049607380 1 105 1 1050 2 1050461554 1 1050w 4 1051 1 1051486634 1 10516468054 2 10521062620 3 10524782074 3 1055 2 10555655037 4 10558246695 4 1057243708 1 10587 1 1059 2 106 1 10609667035 2 1062013 1 1062826373 1 1064967702 1 1065 2 1065232992 1 10654056 4 1065951162 1 10666543777 2 10686092687 5 1068686839 1 10689363695 2 107 2 10727900138 2 1073533480 1 1074 2 10751500518 2 1076164595 1 10764298034 11 10766696449 3 1076796098 1 10774878590 2 1078 2 108 2 10805255044 3 10806006878 2 10809464773 3 1080p 1 1081 2 1081699366 1 1084420068 1 1085 2 10852070603 3 1085478767 1 10887441867 2 1088925029 1 109 1 1090 2 1090598955 1 10915054266 2 1092 2 10952374827 2 1097 2 10978122797 2 10989030989 3 1099444309 1 1099478740 1 10a.htm 1 10e 1 10gb 1 10m 1 10s 1 10t00 1 10t01 3 10t03 1 10t05 1 10t09 2 10t10 5 10t11 2 10t12 2 10t13 3 10t14 9 10t15 3 10t16 2 10t18 2 10t19 2 10t20 3 10t21 1 10t22 1 10th 5 11 605 11.1 1 11.12.20 5 11.14.20 2 11.20.46 3 11.38 1 11.47.45 7 11.49.41 4 11.513 1 110 20 11006943 1 11014 1 11021068390 2 1102297797 1 11024349209 3 1102872845 1 1104037177 1 1104226164 1 1104324149 1 1104346425 1 1105 1 [ reached getOption("max.print") -- omitted 27039 entries ] ``` --- class: inverse ## simplystats data - hunt NAs/weird values ```r colMeans(is.na(tidy_simply)) ``` ``` author datetimestamp id word 1 0 0 0 ``` ```r most_freq = tidy_simply %>% group_by(word) %>% count() %>% arrange(desc(n)) head(most_freq) ``` ``` # A tibble: 6 x 2 # Groups: word [6] word n <chr> <int> 1 the 20936 2 a 14890 3 to 11626 4 of 11166 5 and 9052 6 in 7234 ``` --- class: inverse ## simplystats data - hunt NAs/weird values ```r tidy_simply = tidy_simply %>% anti_join(stop_words) ``` ``` Joining, by = "word" ``` ```r most_freq = tidy_simply %>% group_by(word) %>% count() %>% arrange(desc(n)) head(most_freq) ``` ``` # A tibble: 6 x 2 # Groups: word [6] word n <chr> <int> 1 http 6963 2 data 5394 3 00 2888 4 post 2626 5 span 2408 6 href 2196 ``` --- class: inverse ## Steps in an EDA .huge[ .pull-left[ - Read in data - Figure out what it is - Pre-process it - Look at dimensions - Look at values (str) ] .pull-right[ - Make tables - Hunt for messed up values - Hunt for NAs - <font color="red">Plot it</font> - <font color="red">Don't fool yourself</font> ] ] --- class: inverse background-image: url(../imgs/topic/quartet.png) background-size: 70% background-position: center # Why plot .footnote[http://en.wikipedia.org/wiki/Anscombe's_quartet] --- class: inverse ## Characteristics of exploratory plots .superduper[ - They are made quickly <br> - A large number are made <br> - The goal is for personal understanding <br> - Axes/legends are generally cleaned up <br> - Color/size are primarily used for information ] --- class: inverse ## EDA .superduper[ - EDA is part statistics, part psychology <br> - Unfortunately we (humans) are designed to find patterns even when there aren't any <br> - Visual perception is biased by your humanness. <br> - The key goal in exploratory EDA is to not trick yourself ] --- class: inverse background-image: url(../imgs/topic/optical.png) background-size: 70% background-position: center # Optical illusions teach us about plotting .footnote[http://brainden.com/visual-illusions.htm] --- class: inverse background-image: url(../imgs/topic/optical_3d.png) background-size: 60% background-position: center # Optical illusions teach us about plotting .footnote[http://blog.revolutionanalytics.com/2012/12/create-optical-illusions-with-r.html] --- class: inverse background-image: url(../imgs/topic/ggplot.png) background-size: 50% background-position: center # Plots can be thought of as test statistics .footnote[http://vita.had.co.nz/papers/inference-infovis.pdf] --- class: inverse background-image: url(../imgs/topic/tasks.png) background-size: 45% background-position: center # Background perceptual tasks .footnote[http://www.jstor.org/stable/2288400] --- class: inverse background-image: url(../imgs/topic/pos_length.png) background-size: 90% background-position: center # Position vs. length .footnote[http://www.jstor.org/stable/2288400] --- class: inverse background-image: url(../imgs/topic/pos_length_results.png) background-size: 80% background-position: center # Position vs. length - results .footnote[http://www.jstor.org/stable/2288400] --- class: inverse background-image: url(../imgs/topic/pos_angle.png) background-size: 75% background-position: center # Position vs. angle .footnote[http://www.jstor.org/stable/2288400] --- class: inverse background-image: url(../imgs/topic/pos_angle_results.png) background-size: 90% background-position: center # Position vs. angle - results .footnote[http://www.jstor.org/stable/2288400] --- class: inverse background-image: url(../imgs/topic/slopes.png) background-size: 50% background-position: center # The worst - maybe slopes? .footnote[http://www.jstor.org/stable/2288400] --- class: inverse background-image: url(../imgs/topic/scale.png) background-size: 80% background-position: center # Scale matters .footnote[http://statweb.stanford.edu/~cgates/PERSI/papers/scatter82.pdf] --- class: inverse background-image: url(../imgs/topic/corr.png) background-size: 50% background-position: center # People perceive correlations weirdly .footnote[http://statweb.stanford.edu/~cgates/PERSI/papers/scatter82.pdf] --- class: inverse background-image: url(../imgs/topic/linear.png) background-size: 80% background-position: center # Detecting even linear relationships .footnote[http://statweb.stanford.edu/~cgates/PERSI/papers/scatter82.pdf] --- class: inverse background-image: url(../imgs/topic/sens_spec.png) background-size: 80% background-position: center # People are bad at significance in plots .footnote[https://peerj.com/articles/589/] --- class: inverse ## Summary .huge[ * Use common scales when possible * When possible use position comparisons * Angle comparisons are hard to interpret (no piecharts!) * No 3-D barcharts * Be careful not to "fool" yourself about significance (either way) ] --- class: inverse ## RskittleBrewer <img src="../imgs/topic/skittle.png" style="height:70%"> ```r devtools::install_github('alyssafrazee/RSkittleBrewer') ``` ```r trop = RSkittleBrewer::RSkittleBrewer("tropical") palette(trop) par(pch=19) ``` --- class: inverse ## simplystats data - one d ```r hist(most_freq$n,col = 2); hist(log2(most_freq$n + 1), col = 2) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-7-1.png)<!-- -->![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-7-2.png)<!-- --> --- class: inverse ## simplystats data - one d ```r boxplot(log2(most_freq$n + 1), col = 2) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- class: inverse ## simplystats data - one d ```r nn = dim(most_freq)[1] plot(rep(1,nn), log2(most_freq$n + 1),col = 2); plot(jitter(rep(1,nn)), log2(most_freq$n + 1), col = 2, xlim = c(0.5,1.5)) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-9-1.png)<!-- -->![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-9-2.png)<!-- --> --- class: inverse <img src="../imgs/topic/color_picker.png" style="width:50%"> ```r plot(jitter(rep(1,nn)),log2(most_freq$n+1), col="#00000010",xlim=c(0.5,1.5)) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- class: inverse background-image: url(../imgs/topic/wages.png) background-size: 70% background-position: center # Example: wage data .footnote[http://cran.r-project.org/web/packages/ISLR,<br>http://pavelpodolyak.blogspot.com/2012/06/wages-in-resource-based-economy.html] --- class: inverse ## Colors for confounders ```r library(ISLR) suppressPackageStartupMessages({library(ggplot2)}) p = ggplot(aes(age,wage), data = Wage) p + geom_point(); p + geom_point(aes(colour = jobclass)) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-11-1.png)<!-- -->![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-11-2.png)<!-- --> --- class: inverse ## Smooth scatter for big data ```r smoothScatter(Wage$age,Wage$wage) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- class: inverse ## Smooth scatter for big data ```r library(rafalib); splot(Wage$age,Wage$wage) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-13-1.png)<!-- --> --- class: inverse ## Aside: graphics devices .super[ * <font color="red">**pdf**</font>: useful for line-type graphics, resizes well, usually portable, not efficient if a plot has many objects/points * <font color="red">**png**</font>: bitmapped format, good for line drawings or images with solid colors, uses lossless compression (like the old GIF format), most web browsers can read this format natively, good for plotting many many many points, does not resize well * <font color="red">**jpeg**</font>: good for photographs or natural scenes, uses lossy compression, good for plotting many many many points, does not resize well, can be read by almost any computer and any web browser, not great for line drawings ] --- class: inverse ## Models as visual aid ```r qq = qplot(age,wage,colour=education,data=Wage) qq + geom_smooth(method='lm',formula=y~x) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-14-1.png)<!-- --> --- class: inverse ## Models as visual aid ```r library(modelr); mod1 = lm(wage ~ age*education, data=Wage) Wage1 = Wage %>% add_predictions(mod1) qq <- qplot(age,wage,colour=education,data=Wage1) qq + geom_point(aes(age,pred),colour="black") ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-15-1.png)<!-- --> --- class: inverse ## Residuals, colored ```r Wage1 = Wage1 %>% mutate(resid=wage-pred) qplot(age,resid,colour=education,data=Wage1) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-16-1.png)<!-- --> --- class: inverse ## Factors for plotting groups ```r suppressPackageStartupMessages({library(Hmisc)}) cutWage <- cut2(Wage$wage,g=3); table(cutWage,Wage$jobclass) ``` ``` cutWage 1. Industrial 2. Information [ 20.1, 92.2) 629 371 [ 92.2,118.9) 533 507 [118.9,318.3] 382 578 ``` --- class: inverse ## Factors for plotting groups ```r p1 = qplot(cutWage,age, data=Wage,fill=cutWage, geom=c("boxplot")); p1 ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- class: inverse ## Overlay points ```r p2 = qplot(cutWage,age, data=Wage,fill=cutWage, geom=c("boxplot","jitter")) gridExtra::grid.arrange(p1,p2,ncol=2) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- class: inverse ## Heatmaps ```r library(pheatmap); head(USArrests, 3) ``` ``` Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 ``` ```r args(pheatmap) ``` ``` function (mat, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(100), kmeans_k = NA, breaks = NA, border_color = "grey60", cellwidth = NA, cellheight = NA, scale = "none", cluster_rows = TRUE, cluster_cols = TRUE, clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean", clustering_method = "complete", clustering_callback = identity2, cutree_rows = NA, cutree_cols = NA, treeheight_row = ifelse((class(cluster_rows) == "hclust") || cluster_rows, 50, 0), treeheight_col = ifelse((class(cluster_cols) == "hclust") || cluster_cols, 50, 0), legend = TRUE, legend_breaks = NA, legend_labels = NA, annotation_row = NA, annotation_col = NA, annotation = NA, annotation_colors = NA, annotation_legend = TRUE, annotation_names_row = TRUE, annotation_names_col = TRUE, drop_levels = TRUE, show_rownames = T, show_colnames = T, main = NA, fontsize = 10, fontsize_row = fontsize, fontsize_col = fontsize, display_numbers = F, number_format = "%.2f", number_color = "grey30", fontsize_number = 0.8 * fontsize, gaps_row = NULL, gaps_col = NULL, labels_row = NULL, labels_col = NULL, filename = NA, width = NA, height = NA, silent = FALSE, ...) NULL ``` --- class: inverse ## Heatmaps ```r pheatmap(USArrests) ``` ![](08-topic-models-and-eda-slides_files/figure-html/unnamed-chunk-21-1.png)<!-- --> --- class: inverse background-image: url(../imgs/topic/ggplot2_no.png) background-size: 70% background-position: center .footnote[http://simplystatistics.org/2016/02/11/why-i-dont-use-ggplot2/] --- class: inverse background-image: url(../imgs/topic/ggplot2_yes.png) background-size: 80% background-position: center .footnote[https://hopstat.wordpress.com/2016/02/18/how-i-build-up-a-ggplot2-figure/]