Getting data

class: center, middle, inverse, title-slide

# Getting data
## JHU Data Science
### www.jtleek.com/advdatasci

---

class: center, middle

# What you wish data looked like

![](../imgs/tidy-data-example.png)

---
class: center, middle
# What it actually looks like

![](../imgs/fastq.png)

---
class: center, middle
# What it actually looks like

https://dev.twitter.com/docs/api/1/get/blocks/blocking

![](../imgs/twitter-api.png)

---
class: center, middle
# What it actually looks like

http://healthdesignchallenge.com/

![](../imgs/ehr.png)

---
class: center, middle
# Spreadsheet tales

![](../imgs/spreadsheet-tales.png)

---
class: center, middle
# It gets very crazy

https://github.com/jennybc/2016-06_spreadsheets/blob/master/2016-06_useR-stanford.pdf

![](../imgs/enron-spreadsheet.png)

---
class: center, middle
# Other people's data

[#otherpeoplesdata](https://twitter.com/search?q=%23otherpeoplesdata)

![](../imgs/otherpeoples-data.png)

---
class: center, middle
# Where you wish data were

![](../imgs/your-laptop.png)

---
class: center, middle
# Where they actually are

![](../imgs/databases.png)

---
class: center, middle
# What they actually are

https://dev.twitter.com/docs/api/1/get/blocks/blocking

![](../imgs/twitter-api.png)

---
class: center, middle
# What they actually are

https://data.baltimorecity.gov

![](../imgs/bmore-data.png)

---
class: center, middle
# Our plan

![](../imgs/all-data.png)

---
class: center, middle

# Data brainstorming

https://goo.gl/9j3T7y

---
class: center, middle
# What are data?

http://en.wikipedia.org/wiki/Data

Data are values of qualitative or quantitative variables, belonging to a set of items.

---
class: center, middle
# Relativity of raw data

https://simplystatistics.org/2016/07/20/relativity-raw-data/

...raw data is raw to you if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different "raw data set"..

---
class: center, middle
# The relativity of raw data - example

![](../imgs/hiseq.png)

---
class: center, middle
# The relativity of raw data - example

![](../imgs/hiseq-workflow.png)

---
class: center, middle

# How to share data

![](../imgs/ellis-datashare.png)

---
# The four parts

1. The raw data.
2. A tidy data set
3. A code book describing each variable and its values in the tidy data set.
4. An explicit and exact recipe you used to go from 1 -> 2,3

---
class: center, middle

# Raw data

![](../imgs/raw-data.png)

---
class: center, middle

# Tidy data

![](../imgs/tidy-data.png)

---
class: center, middle

# Code book

![](../imgs/code-book.png)

---
class: center, middle

# Recipe

![](../imgs/recipe-best.png)

---
class: center, middle

# Recipe

![](../imgs/recipe-meh.png)

---
class: center, middle

# Recipe

![](../imgs/recipe-no.png)

---
class: center, middle

Getting data

---
# Relative versus absolute paths

.pull-left[
__Do__

```r
setwd("../data")
setwd("./files")
setwd("..\tmp")
```

]
.pull-right[
__Don't__

```r
setwd("/Users/jtleek/data")
setwd("~/Desktop/files/data")
setwd("C:\\Users\\Andrew\\Downloads")
```
]

---
# Finding and creating files

```r
if(!file.exists("data")){
  dir.create("data")
}
list.files("data")
```

---
# Downloading data

```r
file_url <- paste0("https://data.baltimorecity.gov/api/",
 "views/dz54-2aru/rows.csv?accessType=DOWNLOAD")

download.file(file_url,
 destfile="cameras.csv")
list.files(".")
date_downloaded <- date()
date_downloaded
```

---
class: inverse, center, middle

# Google Sheets

---
class: inverse, center, middle

# Google Sheets

---
background-image: url(../imgs/getting_data/gsheet.png)
background-size: 80% 
background-position: center

# https://docs.google.com/spreadsheets

---
background-image: url(../imgs/getting_data/gsheets_jenny.png)
background-size: 80% 
background-position: bottom

# https://speakerdeck.com/jennybc/googlesheets-talk-at-user2015

---
class: inverse, middle

<code class="r">install.packages("googlesheets")</code>

<code class="r">library(googlesheets)</code>

<code class="r">?gs_read</code>
 
<code class="r">?"cell-specification"</code>

</div>

---
background-image: url(../imgs/getting_data/sheet1.png)
background-size: 100% 
background-position: center

# https://docs.google.com/spreadsheets/d/18KQQd4LY5k8Ucux1MvWCsQGQJlvd0ECTnn-3ixdOKFM/pubhtml

---
background-image: url(../imgs/getting_data/sheet2.png)
background-size: 100% 
background-position: center

# Publish to the web

---
class: inverse, middle

<code class="r">sheets_url = "https://docs.google.com/spreadsheets</code>
<code class="r">/d/18KQQd4LY5k8Ucux1MvWCsQGQJlvd0ECTnn-3ixdOKFM/pubhtml"</code>
 
 
<code class="r">gsurl1 = gs_url(sheets_url)</code>
 
<code class="r">dat = gs_read(gsurl1)</code>

</div>

---
class: inverse, middle, center

# JSON

---
background-image: url(../imgs/getting_data/json.png)
background-size: 100% 
background-position: center

# https://en.wikipedia.org/wiki/JSON

---
background-image: url(../imgs/getting_data/json.png)
background-size: 100% 
background-position: center

# Why JSON matters

https://developer.github.com/v3/search/

---
# Reading in JSON: jsonlite

```r
github_url = "https://api.github.com/users/jtleek/repos"

install.packages("jsonlite")
library(jsonlite)
jsonData <- fromJSON(github_url)
dim(jsonData)
jsonData$name
```

---

# Data frame structure from JSON

```r
table(sapply(jsonData,class))

dim(jsonData$owner)

names(jsonData$owner)
```

---
class: inverse, middle, center

# Web Scraping

---
background-image: url(../imgs/getting_data/this_is_data.png)
background-size: 80% 
background-position: bottom

# This is data.right[http://bowtie-bio.sourceforge.net/recount/]

---
background-image: url(../imgs/getting_data/view_source.png)
background-size: 80% 
background-position: center

# View Source

---
background-image: url(../imgs/getting_data/computer_sees.png)
background-size: 80% 
background-position: center

# What the computer sees
---
background-image: url(../imgs/getting_data/inspect_element.png)
background-size: 80% 
background-position: center

# Inspect element
---
background-image: url(../imgs/getting_data/copy_xpath.png)
background-size: 80% 
background-position: center

# Copy XPath

---
background-image: url(../imgs/getting_data/selector_gadget.png)
background-size: 80% 
background-position: center

# Selector Gadget
---
background-image: url(../imgs/getting_data/run_selector_gadget.png)
background-size: 80% 
background-position: center

# Selector Gadget

---

## `rvest` package

```r
recount_url = "http://bowtie-bio.sourceforge.net/recount/"
install.packages("rvest")
library(rvest)
htmlfile = read_html(recount_url)

nds = html_nodes(htmlfile,                
xpath='//*[@id="recounttab"]/table')
dat = html_table(nds)
dat = as.data.frame(dat)
head(dat)
```

---
background-image: url(../imgs/getting_data/okcupid.png)
background-size: 80% 
background-position: center

# http://motherboard.vice.com/read/70000-okcupid-users-just-had-their-data-published

---
background-image: url(../imgs/getting_data/guardian.png)
background-size: 80% 
background-position: center

# https://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden

---
class: inverse, center, middle

# APIs

---
background-image: url(../imgs/getting_data/apis.png)
background-size: 80% 
background-position: bottom

# Application Programming Interfaces.right[http://bowtie-bio.sourceforge.net/recount/]

---
background-image: url(../imgs/getting_data/pubmed.png)
background-size: 100% 
background-position: bottom

# In biology too!.right[http://www.ncbi.nlm.nih.gov/books/NBK25501/]

---
background-image: url(../imgs/getting_data/step0.png)
background-size: 80% 
background-position: bottom

# Step 0: Did someone do this already.right[https://ropensci.org/]

---
background-image: url(../imgs/getting_data/figshare.png)
background-size: 80% 
background-position: bottom

# Figshare.right[https://figshare.com]

---

## Figshare API wrapper

```r
install.packages("rfigshare")
library(rfigshare)
leeksearch = fs_search("Leek")

length(leeksearch)
leeksearch[[1]]
```

---
background-image: url(../imgs/getting_data/diy.png)
background-size: 100% 
background-position: bottom

# Do it yourself

---
background-image: url(../imgs/getting_data/read_docs.png)
background-size: 100% 
background-position: bottom

# Read the docs

---
background-image: url(../imgs/getting_data/api_limit.png)
background-size: 100% 
background-position: center

# Read the docs

---
background-image: url(../imgs/getting_data/example_query.png)
background-size: 100% 
background-position: center

# Example query

---
class: middle

# A dissected example