Exploratory analysis

Jeffrey Leek
Johns Hopkins Bloomberg School of Public Health

Pro tip

It is a weird time in academics. There used to be one track to being well-known

Publish good papers -> "Fame"

Now there are a lot of other possible routes

Blogs, Twitter, Github, R packages, Publish good papers -> "Fame"

I like that there is variety. If you want to be taken seriously one absolutely critical thing is to have a portfolio of serious work (good R packages, good papers, blog posts that are basicly science papers).

Papers Blog posts of the day

A good book

EDA - plotting

  • During EDA you will often have the choice of performing a statistical test/summary or making a plot. Make the plot.
  • This doesn't mean don't think!
  • Choosing how to make plots and using them to convince yourself/others that trends are real is an important skill.

Why plot

Characteristics of exploratory graphs

  • They are made quickly
  • A large number are made
  • The goal is for personal understanding
  • Axes/legends are generally cleaned up
  • Color/size are primarily used for information

EDA

  • EDA is part statistics, part psychology
  • Ufortunately we (humans) are designed to find patterns even when there aren't any
  • Visual perception is biased by your humanness.
  • The key goal in exploratory EDA is to not trick yourself

What optical illusions tell us about EDA

What optical illusions tell us about EDA

Plots can be thought of as test statistics

Background - perceptual tasks

Position versus length

Position versus length - results

Position versus angle

Position versus angle - results

In general people are bad at slopes

More experimental results

Scale matters

People perceive correlations weirdly

Detecting significance

Fisher, Anderson, Leek (in prep.)

People are bad at this

Summary

  • Use common scales when possible
  • When possible use position comparisons
  • Angle comparisons are frequently hard to interpret (no piecharts!)
  • No 3-D barcharts
  • Be careful not to "fool" yourself about significance (either way)

The most important plots

  • plot - scatterplot
    • ask,mar,mfrow,cex,cex.axis(etc.),col,pch,lwd,lty,las
  • qqplot - Quantile-quantile plot
    • use abline(c(0,1))
  • boxplot boxplots for distributions
    • varwidth, col
  • hist for distributions (multiple)
    • varwdith
  • image, heatmap - multivariate data
    • image(t(dat)[,nrow(dat):1])
    • x,y,z,zlim,col,breaks
  • stripchart - like boxplot but with points
    • jitter, metohd, at, add,vertical

What is a Graphics Device?

  • A graphics device is something where you can make a plot appear

    • A window on your computer (screen device)
    • A PDF file (file device)
    • A PNG or JPEG file (file device)
    • A scalable vector graphics (SVG) file (file device)
  • When you make a plot in R, it has to be "sent" to a specific graphics device

  • The most common place for a plot to be "sent" is the screen device

    • On a Mac the screen device is launched with the quartz()
    • On Windows the screen device is launched with windows()
    • On Unix/Linux the screen device is launched with x11()

How Does a Plot Get Created?

The second approach to plotting is most commonly used for file devices:

  1. Explicitly launch a graphics device

  2. Call a plotting function to make a plot (Note: if you are using a file device, no plot will appear on the screen)

  3. Annotate plot if necessary

  4. Explicitly close graphics device with dev.off() (this is very important!)

pdf(file = "myplot.pdf")  ## Open PDF device; create 'myplot.pdf' in my working directory
## Create plot and send to a file (no plot appears on screen)
with(faithful, plot(eruptions, waiting))  
title(main = "Old Faithful Geyser data")  ## Annotate plot; still nothing on screen
dev.off()  ## Close the PDF file device
## Now you can view the file 'myplot.pdf' on your computer

Graphics File Devices

There are two basic types of file devices: vector and bitmap devices

Vector formats:

  • pdf: useful for line-type graphics, resizes well, usually portable, not efficient if a plot has many objects/points

  • svg: XML-based scalable vector graphics; supports animation and interactivity, potentially useful for web-based plots

  • win.metafile: Windows metafile format (only on Windows)

  • postscript: older format, also resizes well, usually portable, can be used to create encapsulated postscript files; Windows systems often don’t have a postscript viewer

Graphics File Devices

Bitmap formats

  • png: bitmapped format, good for line drawings or images with solid colors, uses lossless compression (like the old GIF format), most web browsers can read this format natively, good for plotting many many many points, does not resize well

  • jpeg: good for photographs or natural scenes, uses lossy compression, good for plotting many many many points, does not resize well, can be read by almost any computer and any web browser, not great for line drawings

  • tiff: Creates bitmap files in the TIFF format; supports lossless compression

  • bmp: a native Windows bitmapped format

Copying Plots

Copying a plot to another device can be useful because some plots require a lot of code and it can be a pain to type all that in again for a different device.

  • dev.copy: copy a plot from one device to another

  • dev.copy2pdf: specifically copy a plot to a PDF file

NOTE: Copying a plot is not an exact operation, so the result may not be identical to the original.

library(datasets)
with(faithful, plot(eruptions, waiting))  ## Create plot on screen device
title(main = "Old Faithful Geyser data")  ## Add a main title
dev.copy(png, file = "geyserplot.png")  ## Copy my plot to a PNG file
dev.off()  ## Don't forget to close the PNG device!

Summary

  • Plots must be created on a graphics device

  • The default graphics device is almost always the screen device, which is most useful for exploratory analysis

  • File devices are useful for creating plots that can be included in other documents or sent to other people

  • For file devices, there are vector and bitmap formats

    • Vector formats are good for line drawings and plots with solid colors using a modest number of points
    • Bitmap formats are good for plots with a large number of points, natural scenes or web-based plots

Plotting and Color

  • The default color schemes for most plots in R are horrendous
    • I don’t have good taste and even I know that
  • Recently there have been developments to improve the handling/specifica1on of colors in plots/graphs/etc.
  • There are functions in R and in external packages that are very handy

Colors 1, 2, and 3

Default Image Plots in R

Color U1li1es in R

  • The grDevices package has two functions
    • colorRamp
    • colorRampPalette
  • These functions take palettes of colors and help to interpolate between the colors
  • The function colors() lists the names of colors you can use in any plotting function

Color Palette Utilities in R

  • colorRamp: Take a palette of colors and return a function that takes valeus between 0 and 1, indicating the extremes of the color palette (e.g. see the 'gray' function)
  • colorRampPalette: Take a palette of colors and return a function that takes integer arguments and returns a vector of colors interpolating the palette (like heat.colors or topo.colors)

colorRamp

[,1] [,2] [,3] corresponds to [Red] [Blue] [Green]

> pal <- colorRamp(c("red", "blue"))

> pal(0)
     [,1] [,2] [,3]
[1,]  255    0    0

> pal(1)
     [,1] [,2] [,3]
[1,]    0    0  255

> pal(0.5)
      [,1] [,2]  [,3]
[1,] 127.5    0 127.5

colorRamp

> pal(seq(0, 1, len = 10))
                  [,1] [,2]       [,3]
        [1,] 255.00000    0          0
        [2,] 226.66667    0   28.33333
        [3,] 198.33333    0   56.66667
        [4,] 170.00000    0   85.00000
        [5,] 141.66667    0  113.33333
        [6,] 113.33333    0  141.66667
        [7,]  85.00000    0  170.00000
        [8,]  56.66667    0  198.33333
        [9,]  28.33333    0  226.66667
        [10,]  0.00000    0  255.00000

colorRampPalette

> pal <- colorRampPalette(c("red", "yellow"))

> pal(2)
[1] "#FF0000" "#FFFF00"

> pal(10)
 [1] "#FF0000" "#FF1C00" "#FF3800" "#FF5500" "#FF7100"
 [6] "#FF8D00" "#FFAA00" "#FFC600" "#FFE200" "#FFFF00”

RColorBrewer Package

  • One package on CRAN that contains interes1ng/useful color palettes

  • There are 3 types of palettes

    • Sequential
    • Diverging
    • Qualitative
  • Palette informa1on can be used in conjunction with the colorRamp() and colorRampPalette()

RColorBrewer and colorRampPalette

> library(RColorBrewer)

> cols <- brewer.pal(3, "BuGn")

> cols
[1] "#E5F5F9" "#99D8C9" "#2CA25F"

> pal <- colorRampPalette(cols)

> image(volcano, col = pal(20))

RColorBrewer and colorRampPalette

The smoothScatter function

Some other plotting notes

  • The rgb function can be used to produce any color via red, green, blue proportions
  • Color transparency can be added via the alpha parameter to rgb
  • The colorspace package can be used for a different control over colors

Scatterplot with no transparency

Scatterplot with transparency

Summary

  • Careful use of colors in plots/maps/etc. can make it easier for the reader to get what you're trying to say (why make it harder?)
  • The RColorBrewer package is an R package that provides color palettes for sequential, categorical, and diverging data
  • The colorRamp and colorRampPalette functions can be used in conjunction with color palettes to connect data to colors
  • Transparency can sometimes be used to clarify plots with many points

What is ggplot2?

  • An implementation of The Grammar of Graphics by Leland Wilkinson
  • Written by Hadley Wickham (while he was a graduate student at Iowa State)
  • A “third” graphics system for R (along with base and lattice)
  • Available from CRAN via install.packages()
  • Web site: http://ggplot2.org (better documentation)

What is ggplot2?

  • Grammar of graphics represents an abstraction of graphics ideas/objects
  • Think “verb”, “noun”, “adjective” for graphics
  • Allows for a “theory” of graphics on which to build new graphics and graphics objects
  • “Shorten the distance from mind to page”

Grammer of Graphics

“In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system”

  • from ggplot2 book

Plotting Systems in R: Base

  • “Artist’s palette” model
  • Start with blank canvas and build up from there
  • Start with plot function (or similar)
  • Use annotation functions to add/modify (text, lines, points, axis)

Plotting Systems in R: Base

  • Convenient, mirrors how we think of building plots and analyzing data
  • Can’t go back once plot has started (i.e. to adjust margins); need to plan in advance
  • Difficult to “translate” to others once a new plot has been created (no graphical “language”)
    • Plot is just a series of R commands

Plotting Systems in R: Lattice

  • Plots are created with a single function call (xyplot, bwplot, etc.)
  • Most useful for conditioning types of plots: Looking at how \(y\) changes with \(x\) across levels of \(z\)
  • Things like margins/spacing set automatically because entire plot is specified at once
  • Good for putting many many plots on a screen

Plotting Systems in R: Lattice

  • Sometimes awkward to specify an entire plot in a single function call
  • Annotation in plot is not intuitive
  • Use of panel functions and subscripts difficult to wield and requires intense preparation
  • Cannot “add” to the plot once it’s created

Plotting Systems in R: ggplot2

  • Split the difference between base and lattice
  • Automatically deals with spacings, text, titles but also allows you to annotate by “adding”
  • Superficial similarity to lattice but generally easier/more intuitive to use
  • Default mode makes many choices for you (but you can customize!)

The Basics: qplot()

  • Works much like the plot function in base graphics system
  • Looks for data in a data frame, similar to lattice, or in the parent environment
  • Plots are made up of aesthetics (size, shape, color) and geoms (points, lines)

The Basics: qplot()

  • Factors are important for indicating subsets of the data (if they are to have different properties); they should be labeled
  • The qplot() hides what goes on underneath, which is okay for most operations
  • ggplot() is the core function and very flexible for doing things qplot() cannot do

Example Dataset

library(ggplot2)
str(mpg)
'data.frame':   234 obs. of  11 variables:
 $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
 $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
 $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
 $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
 $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
 $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
 $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

ggplot2 “Hello, world!”

qplot(displ, hwy, data = mpg)
plot of chunk unnamed-chunk-4

Modifying aesthetics

qplot(displ, hwy, data = mpg, color = drv)
plot of chunk unnamed-chunk-5

Adding a geom

qplot(displ, hwy, data = mpg, geom = c("point", "smooth"))
plot of chunk unnamed-chunk-6

Histograms

qplot(hwy, data = mpg, fill = drv)
plot of chunk unnamed-chunk-7

Facets

qplot(displ, hwy, data = mpg, facets = . ~ drv)
plot of chunk unnamed-chunk-8
qplot(hwy, data = mpg, facets = drv ~ ., binwidth = 2)
plot of chunk unnamed-chunk-8

Summary of qplot()

  • The qplot() function is the analog to plot() but with many built-in features
  • Syntax somewhere in between base/lattice
  • Produces very nice graphics, essentially publication ready (if you like the design)
  • Difficult to go against the grain/customize (don’t bother; use full ggplot2 power in that case)

Resources

  • The ggplot2 book by Hadley Wickham
  • The R Graphics Cookbook by Winston Chang (examples in base plots and in ggplot2)
  • ggplot2 web site (http://ggplot2.org)
  • ggplot2 mailing list (http://goo.gl/OdW3uB), primarily for developers

Why do we use graphs in data analysis?

  • To understand data properties
  • To find patterns in data
  • To suggest modeling strategies
  • To "debug" analyses
  • To communicate results

Expository graphs

  • To understand data properties
  • To find patterns in data
  • To suggest modeling strategies
  • To "debug" analyses
  • To communicate results

Characteristics of expository graphs

  • The goal is to communicate information
  • Information density is generally good
  • Color/size are used both for aesthetics and communication
  • Expository figures have understandable axes, titles, and legends

Housing data


pData <- read.csv("./data/ss06pid.csv")

Axes

Important parameters: xlab,ylab,cex.lab,cex.axis

plot(pData$JWMNP,pData$WAGP,pch=19,col="blue",cex=0.5,
     xlab="Travel time (min)",ylab="Last 12 month wages (dollars)")
plot of chunk unnamed-chunk-9

Axes

plot(pData$JWMNP,pData$WAGP,pch=19,col="blue",cex=0.5,
     xlab="Travel time (min)",ylab="Last 12 month wages (dollars)",cex.lab=2,cex.axis=1.5)
plot of chunk unnamed-chunk-10

Legends

  • Important paramters: x,y,legend, other plotting parameters
plot(pData$JWMNP,pData$WAGP,pch=19,col="blue",cex=0.5,xlab="TT (min)",ylab="Wages (dollars)")
legend(100,200000,legend="All surveyed",col="blue",pch=19,cex=0.5)
plot of chunk unnamed-chunk-11

Legends

plot(pData$JWMNP,pData$WAGP,pch=19,cex=0.5,xlab="TT (min)",ylab="Wages (dollars)",col=pData$SEX)
legend(100,200000,legend=c("men","women"),col=c("black","red"),pch=c(19,19),cex=c(0.5,0.5))
plot of chunk unnamed-chunk-12

Titles

plot(pData$JWMNP,pData$WAGP,pch=19,cex=0.5,xlab="CT (min)",
     ylab="Wages (dollars)",col=pData$SEX,main="Wages earned versus commute time")
legend(100,200000,legend=c("men","women"),col=c("black","red"),pch=c(19,19),cex=c(0.5,0.5))
plot of chunk unnamed-chunk-13

Multiple panels

par(mfrow=c(1,2))
hist(pData$JWMNP,xlab="CT (min)",col="blue",breaks=100,main="")
plot(pData$JWMNP,pData$WAGP,pch=19,cex=0.5,xlab="CT (min)",ylab="Wages (dollars)",col=pData$SEX)
legend(100,200000,legend=c("men","women"),col=c("black","red"),pch=c(19,19),cex=c(0.5,0.5))
plot of chunk unnamed-chunk-14

Adding text

par(mfrow=c(1,2))
hist(pData$JWMNP,xlab="CT (min)",col="blue",breaks=100,main="")
mtext(text="(a)",side=3,line=1)
plot(pData$JWMNP,pData$WAGP,pch=19,cex=0.5,xlab="CT (min)",ylab="Wages (dollars)",col=pData$SEX)
legend(100,200000,legend=c("men","women"),col=c("black","red"),pch=c(19,19),cex=c(0.5,0.5))
mtext(text="(b)",side=3,line=1)
plot of chunk unnamed-chunk-15

Figure captions

plot of chunk unnamed-chunk-16

Figure 1. Distribution of commute time and relationship to wage earned by sex (a) Commute times in the American Community Survey (ACS) are right skewed. (b) Commute times do not appear to be strongly correlated with wage for either sex.

Colorblindness

Graphical workflow

  • Start with a rough plot
  • Tweak it to make it expository
  • Save the file
  • Include it in presentations

Saving files in R is done with graphics devices. Use the command ?Devices to see a list. Here we will go over the most popular devices.

pdf

  • Important parameters: file, height,width
pdf(file="twoPanel.pdf",height=4,width=8)
par(mfrow=c(1,2))
hist(pData$JWMNP,xlab="CT (min)",col="blue",breaks=100,main="")
mtext(text="(a)",side=3,line=1)
plot(pData$JWMNP,pData$WAGP,pch=19,cex=0.5,xlab="CT (min)",ylab="Wages (dollars)",col=pData$SEX)
legend(100,200000,legend=c("men","women"),col=c("black","red"),pch=c(19,19),cex=c(0.5,0.5))
mtext(text="(b)",side=3,line=1)

dev.off()
pdf 
  2 

png

  • Important parameters: file, height,width
png(file="twoPanel.png",height=480,width=(2*480))
par(mfrow=c(1,2))
hist(pData$JWMNP,xlab="CT (min)",col="blue",breaks=100,main="")
mtext(text="(a)",side=3,line=1)
plot(pData$JWMNP,pData$WAGP,pch=19,cex=0.5,xlab="CT (min)",ylab="Wages (dollars)",col=pData$SEX)
legend(100,200000,legend=c("men","women"),col=c("black","red"),pch=c(19,19),cex=c(0.5,0.5))
mtext(text="(b)",side=3,line=1)
dev.off()
pdf 
  2 

dev.copy2pdf

par(mfrow=c(1,2))
hist(pData$JWMNP,xlab="CT (min)",col="blue",breaks=100,main="")
plot(pData$JWMNP,pData$WAGP,pch=19,cex=0.5,xlab="CT (min)",ylab="Wages (dollars)",col=pData$SEX)
plot of chunk unnamed-chunk-19
dev.copy2pdf(file="twoPanelv2.pdf")
pdf 
  2 

Something to avoid

Something to aspire to

Further resources