class: center, middle, inverse, title-slide # Topic Models and EDA ## JHU Data Science ### --- class: inverse, middle, center # Topic Models are a > type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. .footnote[] --- class: inverse background-image: url(../imgs/topic/latent.png) background-size: 80% background-position: center .footnote[] --- class: inverse background-image: url(../imgs/topic/genetic_paper.png) background-size: 80% background-position: center .footnote[] --- class: inverse ## Topic: A probability distribution over a fixed vocabulary <div style='font-size:30pt'> Conceptual algorithm: </div> <br> <div style='font-size:30pt'> 1. Randomly choose a distribution over topics <br> 2. For each word in the document Randomly choose a topic using 1. Randomly choose a word using the topic. </div> --- class: inverse <img src="../imgs/topic/graph_model.png" style="width:80%"> .pull-left[ - β - topic distribution - w - observed words - θ - topic proportions in document - z - topic assignment in document ] .pull-right[ - α - prior on proportions - D - \# of documents - K - \# of topics - η - prior on # of topics ] --- class: inverse background-image: url(../imgs/topic/real_inf.png) background-size: 100% background-position: center --- class: inverse background-image: url(../imgs/topic/real_inf_hand.png) background-size: 100% background-position: center --- class: inverse, middle, center # Topic models lab <font color="red" style='font-size:40pt'> </font> --- class: inverse, middle, center # Exploratory data analysis --- class: inverse background-image: url(../imgs/topic/book.png) background-size: 40% background-position: center # A good book .footnote[] --- class: inverse ## Steps in an EDA .huge[ .pull-left[ - Read in data - Figure out what it is - Pre-process it - Look at dimensions - Look at values (str) ] .pull-right[ - Make tables - Hunt for messed up values - Hunt for NAs - Plot it - Don't fool yourself ] ] --- class: inverse background-image: url(../imgs/topic/vc.png) background-size: 60% background-position: center # Example --- class: inverse background-image: url(../imgs/topic/elephant.png) background-size: 80% background-position: center .footnote[] --- class: inverse background-image: url(../imgs/topic/nature.png) background-size: 80% background-position: center --- class: inverse ## simplystats data - what is it? ```r tdir = file.path(tempdir(), "ss") x = git2r::clone(url = "", local_path = tdir,progress = FALSE) posts = file.path(tdir, "_posts") library(tm) suppressPackageStartupMessages({library(dplyr)}) ds = DirSource(posts) simply = VCorpus(ds) class(simply); length(simply) ``` ``` [1] "VCorpus" "Corpus" ``` ``` [1] 965 ``` ```r str(simply[[1]]) ``` ``` List of 2 $ content: chr [1:14] "---" "title: Example post" "author: jeff" "layout: post" ... $ meta :List of 7 ..$ author : chr(0) ..$ datetimestamp: POSIXlt[1:1], format: "2017-09-20 21:18:03" ..$ description : chr(0) ..$ heading : chr(0) ..$ id : chr "" ..$ language : chr "en" ..$ origin : chr(0) ..- attr(*, "class")= chr "TextDocumentMeta" - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument" ``` --- class: inverse ## simplystats data - preprocess ```r library(tidytext) tidy_simply = simply %>% tidy %>% unnest_tokens(word,text) %>% select(author,datetimestamp,id,word) dim(tidy_simply) ``` ``` [1] 540717 4 ``` ```r str(tidy_simply) ``` ``` Classes 'tbl_df', 'tbl' and 'data.frame': 540717 obs. of 4 variables: $ author : logi NA NA NA NA NA NA ... $ datetimestamp: POSIXct, format: "2017-09-20 17:18:03" "2017-09-20 17:18:03" ... $ id : chr "" "" "" "" ... $ word : chr "title" "example" "post" "author" ... ``` --- class: inverse ## simplystats data - look at dimensions and values ```r tidy_simply %>% %>% head(2) tidy_simply %>% %>% head(2)
```
  author       datetimestamp                                id   word
1     NA 2017-09-20 17:18:03                                  title
2     NA 2017-09-20 17:18:03                               example
```
tidy_simply %>% %>% tail(2)
```
       author       datetimestamp                                id  word
540716     NA 2017-09-20 17:18:04                              after
540717     NA 2017-09-20 17:18:04                               1990
```
tidy_simply %>% group_by(word) %>% count()
```
# A tibble: 28,039 x 2
# Groups:   word [28,039]
   word              n
   <chr>         <int>
 1 __                5
 2 ___              15
 3 _____             3
 4 __abstract__      1
 5 __choose          1
 6 __comparisons     1
 7 __conclusions__   1
 8 __data            4
 9 __give            1
10 __include         1
# ... with 28,029 more rows
```           10th 5
11 605
most_freq = tidy_simply %>% group_by(word) %>% count() %>% arrange(desc(n))
head(most_freq)
```
# A tibble: 6 x 2
# Groups:   word [6]
  word        n
  <chr>   <int>
1 the     20936
2 a       14890
3 to      11626
4 of      11166
5 and      9052
6 in       7234
``` 4 post 2626 5 span 2408 6 href 2196 ``` --- class: inverse ## Steps in an EDA .huge[ .pull-left[ - Read in data - Figure out what it is - Pre-process it - Look at dimensions - Look at values (str) ] .pull-right[ - Make tables - Hunt for messed up values - Hunt for NAs - <font color="red">Plot it</font> - <font color="red">Don't fool yourself</font> ] ] --- class: inverse background-image: url(../imgs/topic/quartet.png) background-size: 70% background-position: center # Why plot .footnote['s_quartet] --- class: inverse ## Characteristics of exploratory plots .superduper[ - They are made quickly <br> - A large number are made <br> - The goal is for personal understanding <br> - Axes/legends are generally cleaned up <br> - Color/size are primarily used for information ] --- class: inverse ## EDA .superduper[ - EDA is part statistics, part psychology <br> - Unfortunately we (humans) are designed to find patterns even when there aren't any <br> - Visual perception is biased by your humanness. <br> - The key goal in exploratory EDA is to not trick yourself ] --- class: inverse background-image: url(../imgs/topic/optical.png) background-size: 70% background-position: center # Optical illusions teach us about plotting .footnote[] --- class: inverse background-image: url(../imgs/topic/optical_3d.png) background-size: 60% background-position: center # Optical illusions teach us about plotting .footnote[] --- class: inverse background-image: url(../imgs/topic/ggplot.png) background-size: 50% background-position: center # Plots can be thought of as test statistics .footnote[] --- class: inverse background-image: url(../imgs/topic/tasks.png) background-size: 45% background-position: center # Background perceptual tasks .footnote[] --- class: inverse background-image: url(../imgs/topic/pos_length.png) background-size: 90% background-position: center # Position vs. length .footnote[] --- class: inverse background-image: url(../imgs/topic/pos_length_results.png) background-size: 80% background-position: center # Position vs. length - results .footnote[] --- class: inverse background-image: url(../imgs/topic/pos_angle.png) background-size: 75% background-position: center # Position vs. angle .footnote[] --- class: inverse background-image: url(../imgs/topic/pos_angle_results.png) background-size: 90% background-position: center # Position vs. angle - results .footnote[] --- class: inverse background-image: url(../imgs/topic/slopes.png) background-size: 50% background-position: center # The worst - maybe slopes? .footnote[] --- class: inverse background-image: url(../imgs/topic/scale.png) background-size: 80% background-position: center # Scale matters .footnote[] --- class: inverse background-image: url(../imgs/topic/corr.png) background-size: 50% background-position: center # People perceive correlations weirdly .footnote[] --- class: inverse background-image: url(../imgs/topic/linear.png) background-size: 80% background-position: center # Detecting even linear relationships .footnote[] --- class: inverse background-image: url(../imgs/topic/sens_spec.png) background-size: 80% background-position: center # People are bad at significance in plots .footnote[] --- class: inverse ## Summary .huge[ * Use common scales when possible * When possible use position comparisons * Angle comparisons are hard to interpret (no piecharts!) * No 3-D barcharts * Be careful not to "fool" yourself about significance (either way) ] --- class: inverse ## RskittleBrewer <img src="../imgs/topic/skittle.png" style="height:70%"> ```r devtools::install_github('alyssafrazee/RSkittleBrewer') ``` ```r trop = RSkittleBrewer::RSkittleBrewer("tropical") palette(trop) par(pch=19) ``` --- class: inverse ## simplystats data - one d ```r hist(most_freq$n,col = 2); hist(log2(most_freq$n + 1), col = 2) ``` <!-- --><!-- --> --- class: inverse ## simplystats data - one d ```r boxplot(log2(most_freq$n + 1), col = 2) ``` <!-- --> --- class: inverse ## simplystats data - one d ```r nn = dim(most_freq)[1] plot(rep(1,nn), log2(most_freq$n + 1),col = 2); plot(jitter(rep(1,nn)), log2(most_freq$n + 1), col = 2, xlim = c(0.5,1.5)) ``` <!-- --><!-- --> --- class: inverse <img src="../imgs/topic/color_picker.png" style="width:50%"> ```r plot(jitter(rep(1,nn)),log2(most_freq$n+1), col="#00000010",xlim=c(0.5,1.5)) ``` <!-- --> --- class: inverse background-image: url(../imgs/topic/wages.png) background-size: 70% background-position: center # Example: wage data .footnote[,<br>] --- class: inverse ## Colors for confounders ```r library(ISLR) suppressPackageStartupMessages({library(ggplot2)}) p = ggplot(aes(age,wage), data = Wage) p + geom_point(); p + geom_point(aes(colour = jobclass)) ``` <!-- --><!-- --> --- class: inverse ## Smooth scatter for big data ```r smoothScatter(Wage$age,Wage$wage) ``` <!-- --> --- class: inverse ## Smooth scatter for big data ```r library(rafalib); splot(Wage$age,Wage$wage) ``` <!-- --> --- class: inverse ## Aside: graphics devices .super[ * <font color="red">**pdf**</font>: useful for line-type graphics, resizes well, usually portable, not efficient if a plot has many objects/points * <font color="red">**png**</font>: bitmapped format, good for line drawings or images with solid colors, uses lossless compression (like the old GIF format), most web browsers can read this format natively, good for plotting many many many points, does not resize well * <font color="red">**jpeg**</font>: good for photographs or natural scenes, uses lossy compression, good for plotting many many many points, does not resize well, can be read by almost any computer and any web browser, not great for line drawings ] --- class: inverse ## Models as visual aid ```r qq = qplot(age,wage,colour=education,data=Wage) qq + geom_smooth(method='lm',formula=y~x) ``` <!-- --> --- class: inverse ## Models as visual aid ```r library(modelr); mod1 = lm(wage ~ age*education, data=Wage) Wage1 = Wage %>% add_predictions(mod1) qq <- qplot(age,wage,colour=education,data=Wage1) qq + geom_point(aes(age,pred),colour="black") ``` <!-- --> --- class: inverse ## Residuals, colored ```r Wage1 = Wage1 %>% mutate(resid=wage-pred) qplot(age,resid,colour=education,data=Wage1) ``` <!-- --> --- class: inverse ## Factors for plotting groups ```r suppressPackageStartupMessages({library(Hmisc)}) cutWage <- cut2(Wage$wage,g=3); table(cutWage,Wage$jobclass) ``` ``` cutWage 1. Industrial 2. Information [ 20.1, 92.2) 629 371 [ 92.2,118.9) 533 507 [118.9,318.3] 382 578 ``` --- class: inverse ## Factors for plotting groups ```r p1 = qplot(cutWage,age, data=Wage,fill=cutWage, geom=c("boxplot")); p1 ``` <!-- --> --- class: inverse ## Overlay points ```r p2 = qplot(cutWage,age, data=Wage,fill=cutWage, geom=c("boxplot","jitter")) gridExtra::grid.arrange(p1,p2,ncol=2) ``` <!-- --> --- class: inverse ## Heatmaps ```r library(pheatmap); head(USArrests, 3) ``` ``` Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 ``` ```r args(pheatmap) ``` ``` function (mat, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(100), kmeans_k = NA, breaks = NA, border_color = "grey60", cellwidth = NA, cellheight = NA, scale = "none", cluster_rows = TRUE, cluster_cols = TRUE, clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean", clustering_method = "complete", clustering_callback = identity2, cutree_rows = NA, cutree_cols = NA, treeheight_row = ifelse((class(cluster_rows) == "hclust") || cluster_rows, 50, 0), treeheight_col = ifelse((class(cluster_cols) == "hclust") || cluster_cols, 50, 0), legend = TRUE, legend_breaks = NA, legend_labels = NA, annotation_row = NA, annotation_col = NA, annotation = NA, annotation_colors = NA, annotation_legend = TRUE, annotation_names_row = TRUE, annotation_names_col = TRUE, drop_levels = TRUE, show_rownames = T, show_colnames = T, main = NA, fontsize = 10, fontsize_row = fontsize, fontsize_col = fontsize, display_numbers = F, number_format = "%.2f", number_color = "grey30", fontsize_number = 0.8 * fontsize, gaps_row = NULL, gaps_col = NULL, labels_row = NULL, labels_col = NULL, filename = NA, width = NA, height = NA, silent = FALSE, ...) NULL ``` --- class: inverse ## Heatmaps ```r pheatmap(USArrests) ``` <!-- --> --- class: inverse background-image: url(../imgs/topic/ggplot2_no.png) background-size: 70% background-position: center .footnote[] --- class: inverse background-image: url(../imgs/topic/ggplot2_yes.png) background-size: 80% background-position: center .footnote[]