class: center, middle, inverse, title-slide # Getting data ## JHU Data Science ### www.jtleek.com/advdatasci --- class: center, middle # What you wish data looked like ![](../imgs/tidy-data-example.png) --- class: center, middle # What it actually looks like ![](../imgs/fastq.png) --- class: center, middle # What it actually looks like https://dev.twitter.com/docs/api/1/get/blocks/blocking ![](../imgs/twitter-api.png) --- class: center, middle # What it actually looks like http://healthdesignchallenge.com/ ![](../imgs/ehr.png) --- class: center, middle # Spreadsheet tales ![](../imgs/spreadsheet-tales.png) --- class: center, middle # It gets very crazy https://github.com/jennybc/2016-06_spreadsheets/blob/master/2016-06_useR-stanford.pdf ![](../imgs/enron-spreadsheet.png) --- class: center, middle # Other people's data [#otherpeoplesdata](https://twitter.com/search?q=%23otherpeoplesdata) ![](../imgs/otherpeoples-data.png) --- class: center, middle # Where you wish data were ![](../imgs/your-laptop.png) --- class: center, middle # Where they actually are ![](../imgs/databases.png) --- class: center, middle # What they actually are https://dev.twitter.com/docs/api/1/get/blocks/blocking ![](../imgs/twitter-api.png) --- class: center, middle # What they actually are https://data.baltimorecity.gov ![](../imgs/bmore-data.png) --- class: center, middle # Our plan ![](../imgs/all-data.png) --- class: center, middle # Data brainstorming <span style="font-size:52px">https://goo.gl/9j3T7y<span> --- class: center, middle # What are data? http://en.wikipedia.org/wiki/Data <span style="font-size:52px">Data are values of qualitative or quantitative variables, belonging to a set of items.<span> --- class: center, middle # Relativity of raw data https://simplystatistics.org/2016/07/20/relativity-raw-data/ <span style="font-size:40px"> ...raw data is raw to you if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different "raw data set"..<span> --- class: center, middle # The relativity of raw data - example ![](../imgs/hiseq.png) --- class: center, middle # The relativity of raw data - example ![](../imgs/hiseq-workflow.png) --- class: center, middle # How to share data ![](../imgs/ellis-datashare.png) --- # The four parts 1. The raw data. 2. A tidy data set 3. A code book describing each variable and its values in the tidy data set. 4. An explicit and exact recipe you used to go from 1 -> 2,3 --- class: center, middle # Raw data ![](../imgs/raw-data.png) --- class: center, middle # Tidy data ![](../imgs/tidy-data.png) --- class: center, middle # Code book ![](../imgs/code-book.png) --- class: center, middle # Recipe ![](../imgs/recipe-best.png) --- class: center, middle # Recipe ![](../imgs/recipe-meh.png) --- class: center, middle # Recipe ![](../imgs/recipe-no.png) --- class: center, middle <span style="font-size:52px">Getting data<span> --- # Relative versus absolute paths .pull-left[ __Do__ ```r setwd("../data") setwd("./files") setwd("..\tmp") ``` ] .pull-right[ __Don't__ ```r setwd("/Users/jtleek/data") setwd("~/Desktop/files/data") setwd("C:\\Users\\Andrew\\Downloads") ``` ] --- # Finding and creating files ```r if(!file.exists("data")){ dir.create("data") } list.files("data") ``` --- # Downloading data ```r file_url <- paste0("https://data.baltimorecity.gov/api/", "views/dz54-2aru/rows.csv?accessType=DOWNLOAD") download.file(file_url, destfile="cameras.csv") list.files(".") date_downloaded <- date() date_downloaded ``` --- class: inverse, center, middle # Google Sheets --- class: inverse, center, middle # Google Sheets --- background-image: url(../imgs/getting_data/gsheet.png) background-size: 80% background-position: center # https://docs.google.com/spreadsheets --- background-image: url(../imgs/getting_data/gsheets_jenny.png) background-size: 80% background-position: bottom # <p style='font-size:20pt'>https://speakerdeck.com/jennybc/googlesheets-talk-at-user2015</p> --- class: inverse, middle <div style="font-size:28pt"> <code class="r">install.packages("googlesheets")</code> <code class="r">library(googlesheets)</code> <br> <code class="r">?gs_read</code> <br> <code class="r">?"cell-specification"</code> </div> --- background-image: url(../imgs/getting_data/sheet1.png) background-size: 100% background-position: center # <p style='font-size:14pt'>https://docs.google.com/spreadsheets/d/18KQQd4LY5k8Ucux1MvWCsQGQJlvd0ECTnn-3ixdOKFM/pubhtml</p> --- background-image: url(../imgs/getting_data/sheet2.png) background-size: 100% background-position: center # Publish to the web --- class: inverse, middle <div style="font-size:24pt"> <code class="r">sheets_url = "https://docs.google.com/spreadsheets</code> <code class="r">/d/18KQQd4LY5k8Ucux1MvWCsQGQJlvd0ECTnn-3ixdOKFM/pubhtml"</code> <br> <br> <code class="r">gsurl1 = gs_url(sheets_url)</code> <br> <code class="r">dat = gs_read(gsurl1)</code> </div> --- class: inverse, middle, center # JSON --- background-image: url(../imgs/getting_data/json.png) background-size: 100% background-position: center # https://en.wikipedia.org/wiki/JSON --- background-image: url(../imgs/getting_data/json.png) background-size: 100% background-position: center # Why JSON matters <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> <p style='font-size:14pt'>https://developer.github.com/v3/search/</p> --- # Reading in JSON: jsonlite ```r github_url = "https://api.github.com/users/jtleek/repos" install.packages("jsonlite") library(jsonlite) jsonData <- fromJSON(github_url) dim(jsonData) jsonData$name ``` --- # Data frame structure from JSON ```r table(sapply(jsonData,class)) dim(jsonData$owner) names(jsonData$owner) ``` --- class: inverse, middle, center # Web Scraping --- background-image: url(../imgs/getting_data/this_is_data.png) background-size: 80% background-position: bottom # This is data<p style='font-size:20pt'>.right[http://bowtie-bio.sourceforge.net/recount/]</p> --- background-image: url(../imgs/getting_data/view_source.png) background-size: 80% background-position: center # View Source --- background-image: url(../imgs/getting_data/computer_sees.png) background-size: 80% background-position: center # What the computer sees --- background-image: url(../imgs/getting_data/inspect_element.png) background-size: 80% background-position: center # Inspect element --- background-image: url(../imgs/getting_data/copy_xpath.png) background-size: 80% background-position: center # Copy XPath --- background-image: url(../imgs/getting_data/selector_gadget.png) background-size: 80% background-position: center # Selector Gadget --- background-image: url(../imgs/getting_data/run_selector_gadget.png) background-size: 80% background-position: center # Selector Gadget --- ## `rvest` package ```r recount_url = "http://bowtie-bio.sourceforge.net/recount/" install.packages("rvest") library(rvest) htmlfile = read_html(recount_url) nds = html_nodes(htmlfile, xpath='//*[@id="recounttab"]/table') dat = html_table(nds) dat = as.data.frame(dat) head(dat) ``` --- background-image: url(../imgs/getting_data/okcupid.png) background-size: 80% background-position: center # <p style='font-size:20pt'>http://motherboard.vice.com/read/70000-okcupid-users-just-had-their-data-published</p> --- background-image: url(../imgs/getting_data/guardian.png) background-size: 80% background-position: center # <p style='font-size:20pt'>https://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden</p> --- class: inverse, center, middle # APIs --- background-image: url(../imgs/getting_data/apis.png) background-size: 80% background-position: bottom # Application Programming Interfaces<p style='font-size:20pt'>.right[http://bowtie-bio.sourceforge.net/recount/]</p> --- background-image: url(../imgs/getting_data/pubmed.png) background-size: 100% background-position: bottom # In biology too!<p style='font-size:18pt'>.right[http://www.ncbi.nlm.nih.gov/books/NBK25501/]</p> --- background-image: url(../imgs/getting_data/step0.png) background-size: 80% background-position: bottom # Step 0: Did someone do this already<p style='font-size:20pt'>.right[https://ropensci.org/]</p> --- background-image: url(../imgs/getting_data/figshare.png) background-size: 80% background-position: bottom # Figshare<p style='font-size:20pt'>.right[https://figshare.com]</p> --- ## Figshare API wrapper ```r install.packages("rfigshare") library(rfigshare) leeksearch = fs_search("Leek") length(leeksearch) leeksearch[[1]] ``` --- background-image: url(../imgs/getting_data/diy.png) background-size: 100% background-position: bottom # Do it yourself --- background-image: url(../imgs/getting_data/read_docs.png) background-size: 100% background-position: bottom # Read the docs --- background-image: url(../imgs/getting_data/api_limit.png) background-size: 100% background-position: center # Read the docs --- background-image: url(../imgs/getting_data/example_query.png) background-size: 100% background-position: center # Example query --- class: middle # A dissected example <p style='font-size:30pt'> <font color="lightgray">https://api.github.com/search/repositories?q=created:2014-08-13+language:r+-user:cran&type </font> </p> --- class: middle # Base URL <p style='font-size:30pt'> https://api.github.com/<font color="lightgray">search/repositories?q=created:2014-08-13+language:r+-user:cran&type</font> </p> --- class: middle # The Path: Search repositories <p style='font-size:30pt'> <font color="lightgray">https://api.github.com/</font>search/repositories<font color="lightgray">?q=created:2014-08-13+language:r+-user:cran&type</font> </p> --- class: middle # Create a query <p style='font-size:30pt'> <font color="lightgray">https://api.github.com/search/repositories</font>?q=created:2014-08-13+language:r+-user:cran&type </p> --- class: middle # Date repo was created <p style='font-size:30pt'> <font color="lightgray">https://api.github.com/search/repositories?q=</font>created:2014-08-13<font color="lightgray">+language:r+-user:cran&type</font> </p> --- class: middle # Language repo is in <p style='font-size:30pt'> <font color="lightgray">https://api.github.com/search/repositories?q=created:2014-08-13</font>+language:r<font color="lightgray">+-user:cran&type</font> </p> --- class: middle # Ignore repos from CRAN <p style='font-size:30pt'> <font color="lightgray">https://api.github.com/search/repositories?q=created:2014-08-13+language:r</font>+-user:cran&type </p> --- ## Using httr to call the API ```r install.packages("httr") library(httr) query_url = paste0("https://api.github.com/search/", "repositories?q=created:2014-08-13+language:r+-user:cran") req = GET(query_url) names(content(req)) ``` --- background-image: url(../imgs/getting_data/non_open_apis.png) background-size: 100% background-position: bottom # Not all APIs are "open" --- # Authentication! ```r myapp = oauth_app( "twitter", key="yourConsumerKeyHere", secret="yourConsumerSecretHere") sig = sign_oauth1.0( myapp, token = "yourTokenHere", token_secret = "yourTokenSecretHere") homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig) ``` --- ## But you can get cool data ``` json1 = content(homeTL) json2 = jsonlite::fromJSON(toJSON(json1)) json2[1,1:4] created_at id id_str 1 Mon Jan 13 05:18:04 +0000 2014 4.225984e+17 422598398940684288 text 1 Now that P. Norvig's regex golf IPython notebook hit Slashdot, let's see if our traffic spike tops the previous one: http://t.co/Vc6JhZXOo8 ``` --- ## Summary * `httr` to interact with the web (VERBS like GET/PUT) * `rvest` to grab all the exact elements you want - Check out selector gadget * `googlesheets` for ... Google Sheets * `googledrive` (http://googledrive.tidyverse.org/) just came out * `jsonlite` for JSON