In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
MOST IMPORTANT RULE - LOOK AT YOUR DATA!
In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
MOST IMPORTANT RULE - LOOK AT YOUR DATA!
is.na
- is TRUE
if the data is FALSE
otherwise!
- negation (NOT)
is.na(x)
is TRUE
, then !is.na(x)
is FALSE
all
takes in a logical
and will be TRUE
if ALL are TRUE
all(!is.na(x))
- are all values of x
NOT NAany
will be TRUE
if ANY are true
any(is.na(x))
- do we have any NA
's in x
?complete.cases
- returns TRUE
if EVERY value of a row is NOT NA
FALSE
missing one value (even if not important)One of the most important aspects of data cleaning is missing values.
Types of "missing" data:
NA
- general missing dataNaN
- stands for "Not a Number", happens when you do 0/0.Inf
and -Inf
- Infinity, happens when you take a positive number (or negative number) by 0.Each missing data type has a function that returns TRUE
if the data is missing:
NA
- is.na
NaN
- is.nan
Inf
and -Inf
- is.infinite
is.finite
returns FALSE
for all missing data and TRUE
for non-missingOne important aspect (esp with subsetting) is that logical operations return NA
for NA
values. Think about it, the data could be > 2
or not we don't know, so R
says there is no TRUE
or FALSE
, so that is missing:
x = c(0, NA, 2, 3, 4) x > 2
[1] FALSE NA FALSE TRUE TRUE
What to do? What if we want if x > 2
and x
isn't NA
?
Don't do x != NA
, do x > 2
and x
is NOT NA
:
x != NA
[1] NA NA NA NA NA
x > 2 & !is.na(x)
[1] FALSE FALSE FALSE TRUE TRUE
What about seeing if a value is equal to multiple values? You can do (x == 1 | x == 2) & !is.na(x)
, but that is not efficient.
(x == 0 | x == 2) # has NA
[1] TRUE NA TRUE FALSE FALSE
(x == 0 | x == 2) & !is.na(x) # No NA
[1] TRUE FALSE TRUE FALSE FALSE
what to do?
%in%
NEVER has NA, even if you put it there (BUT DON'T DO THIS):
x %in% c(0, 2, NA) # NEVER has NA and returns logical
[1] TRUE TRUE TRUE FALSE FALSE
x %in% c(0, 2) | is.na(x)
[1] TRUE TRUE TRUE FALSE FALSE
Similarly with logicals, operations/arithmetic with NA
will result in NA
s:
x + 2
[1] 2 NA 4 5 6
x * 2
[1] 0 NA 4 6 8
unique
- gives you the unique values of a variabletable(x)
- will give a one-way table of x
table(x, useNA = "ifany")
- will have row NAtable(x, y)
- will give a cross-tab of x
and y
Here we will use table
to make tabulations of the data. Look at ?table
to see options for missing data.
unique(x)
[1] 0 NA 2 3 4
table(x)
x 0 2 3 4 1 1 1 1
table(x, useNA = "ifany") # will not
x 0 2 3 4 <NA> 1 1 1 1 1
useNA = "ifany"
will not have NA in table heading if no NA
:
table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3), useNA = "ifany")
0 1 2 3 1 1 4 4
You can set useNA = "always"
to have it always have a column for NA
table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3), useNA = "always")
0 1 2 3 <NA> 1 1 4 4 0
If you use a factor
, all levels will be given even if no exist! - (May be wanted or not):
fac = factor(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3), levels = 1:4) tab = table(fac) tab
fac 1 2 3 4 1 4 4 0
tab[ tab > 0 ]
fac 1 2 3 1 4 4
A two-way table. If you pass in 2 vectors, table
creates a 2-dimensional table.
tab <- table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3), c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3), useNA = "always") tab
0 1 2 3 4 <NA> 0 1 0 0 0 0 0 1 0 1 0 0 0 0 2 0 0 2 0 2 0 3 0 0 0 4 0 0 <NA> 0 0 0 0 0 0
margin.table
finds the marginal sums of the table. margin
is 1 for rows, 2 for columns in general in R
. Here is the column sums of the table:
margin.table(tab, 2)
0 1 2 3 4 <NA> 1 1 2 4 2 0
prop.table
finds the marginal proportions of the table. Think of it dividing the table by it's respective marginal totals. If margin
not set, divides by overall total.
prop.table(tab)
0 1 2 3 4 <NA> 0 0.1 0.0 0.0 0.0 0.0 0.0 1 0.0 0.1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.2 0.0 0.2 0.0 3 0.0 0.0 0.0 0.4 0.0 0.0 <NA> 0.0 0.0 0.0 0.0 0.0 0.0
prop.table(tab,1) * 100
0 1 2 3 4 <NA> 0 100 0 0 0 0 0 1 0 100 0 0 0 0 2 0 0 50 0 50 0 3 0 0 0 100 0 0 <NA>
From https://data.baltimorecity.gov/City-Government/Baltimore-City-Employee-Salaries-FY2015/nsfe-bg53 https://data.baltimorecity.gov/api/views/nsfe-bg53/rows.csv
Read the CSV into R Sal
:
Sal = read_csv("http://data.baltimorecity.gov/api/views/nsfe-bg53/rows.csv") Sal = rename(Sal, Name = name)
any()
- checks if there are any TRUE
sall()
- checks if ALL are truehead(Sal,2)
Name JobTitle AgencyID 1 Aaron,Patricia G Facilities/Office Services II A03031 2 Aaron,Petra L ASSISTANT STATE'S ATTORNEY A29045 Agency HireDate AnnualSalary GrossPay 1 OED-Employment Dev (031) 10/24/1979 $55314.00 $53626.04 2 States Attorneys Office (045) 09/25/2006 $74000.00 $73000.08
any(is.na(Sal$Name)) # are there any NAs?
[1] FALSE
For example, let's say gender was coded as Male, M, m, Female, F, f. Using Excel to find all of these would be a matter of filtering and changing all by hand or using if statements.
In R, you can simply do something like:
data$gender[data$gender %in% c("Male", "M", "m")] <- "Male"
or use ifelse
data %>% mutate(gender = ifelse(gender %in% c("Male", "M", "m"), "Male", gender))
Sometimes though, it's not so simple. That's where functions that find patterns come in very useful.
table(gender)
gender F FeMAle FEMALE Fm M Ma mAle Male MaLe MALE 75 82 74 89 89 79 87 89 88 95 Man Woman 73 80
paste
and paste0
Paste can be very useful for joining vectors together:
paste("Visit", 1:5, sep = "_")
[1] "Visit_1" "Visit_2" "Visit_3" "Visit_4" "Visit_5"
paste("Visit", 1:5, sep = "_", collapse = " ")
[1] "Visit_1 Visit_2 Visit_3 Visit_4 Visit_5"
paste("To", "is going be the ", "we go to the store!", sep = "day ")
[1] "Today is going be the day we go to the store!"
# and paste0 can be even simpler see ?paste0 paste0("Visit",1:5)
[1] "Visit1" "Visit2" "Visit3" "Visit4" "Visit5"
paste(1:5)
[1] "1" "2" "3" "4" "5"
paste(1:5, collapse = " ")
[1] "1 2 3 4 5"
Useful String functions
toupper()
, tolower()
- uppercase or lowercase your data:str_trim()
(in the stringr
package) or trimws
in base
nchar
- get the number of characters in a stringstringr
packageLike dplyr
, the stringr
package:
data.frame
in dplyr
.
matches any single character*
means repeat as many (even if 0) more times the last character?
makes the last thing optional^
matches start of vector ^a
- starts with "a"$
matches end of vector b$
- ends with "b"Very similar:
Base R
substr(x, start, stop)
- substrings from position start to position stopstrsplit(x, split)
- splits strings up - returns list!stringr
str_sub(x, start, end)
- substrings from position start to position endstr_split(string, pattern)
- splits strings up - returns list!In base R, strsplit
splits a vector on a string into a list
x <- c("I really", "like writing", "R code programs") y <- strsplit(x, split = " ") # returns a list y
[[1]] [1] "I" "really" [[2]] [1] "like" "writing" [[3]] [1] "R" "code" "programs"
stringr
stringr::str_split
does the same thing:
library(stringr) y2 <- str_split(x, " ") # returns a list y2
[[1]] [1] "I" "really" [[2]] [1] "like" "writing" [[3]] [1] "R" "code" "programs"
One example case is when you want to split on a period ".
". In regular expressions .
means ANY character, so
str_split("I.like.strings", ".")
[[1]] [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
str_split("I.like.strings", fixed("."))
[[1]] [1] "I" "like" "strings"
y
suppressPackageStartupMessages(library(dplyr)) # must be loaded AFTER plyr y[[2]]
[1] "like" "writing"
sapply(y, dplyr::first) # on the fly
[1] "I" "like" "R"
sapply(y, nth, 2) # on the fly
[1] "really" "writing" "code"
sapply(y, last) # on the fly
[1] "really" "writing" "programs"
stringr
str_detect
, str_subset
, str_replace
, and str_replace_all
search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
str_detect
- returns TRUE
if pattern
is foundstr_subset
- returns only the strings which pattern were detected
x[str_detect(x, pattern)]
str_extract
- returns only strings which pattern were detected, but ONLY the patternstr_replace
- replaces pattern
with replacement
the first timestr_replace_all
- replaces pattern
with replacement
as many times matchedstringr
?modifiers
fixed
- match everything exactlyregexp
- default - uses regular expressionsignore_case
is an option to not have to use tolower
stringr
and dplyr
str_subset(Sal$Name, "Rawlings")
[1] "Rawlings,Kellye A" "Rawlings,Paula M" [3] "Rawlings-Blake,Stephanie C"
Sal %>% filter(str_detect(Name, "Rawlings"))
Name JobTitle AgencyID 1 Rawlings,Kellye A EMERGENCY DISPATCHER A40302 2 Rawlings,Paula M COMMUNITY AIDE A04015 3 Rawlings-Blake,Stephanie C MAYOR A01001 Agency HireDate AnnualSalary GrossPay 1 M-R Info Technology (302) 01/06/2003 $48940.00 $73356.42 2 R&P-Recreation (015) 12/10/2007 $19802.00 $10443.70 3 Mayors Office (001) 12/07/1995 $167449.00 $165249.86
stringr
We can do the same thing (with 2 piping operations!) in dplyr
dplyr_sal = Sal dplyr_sal = dplyr_sal %>% mutate( AnnualSalary = AnnualSalary %>% str_replace(fixed("$"), "") %>% as.numeric) %>% arrange(desc(AnnualSalary))
str_extract
and str_extract_all
str_extract_all
extracts all the matched strings - \\d
searches for DIGITS/numbers
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "2" "6" "9" "4" "9"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]] [1] "0" "3" "0" "3" "1" [[2]] [1] "2" "9" "0" "4" "5"
sort
- reorders the data - characters work, but notrank
- gives the rank of the data - ties are splitorder
- gives the indices, if subset, would give the data sorted
x[order(x)]
is the same as sortingsort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
[1] "1" "10" "2"
order(c("1", "2", "10"))
[1] 1 3 2
x = rnorm(10) x[1] = x[2] # create a tie rank(x)
[1] 3.5 3.5 1.0 8.0 5.0 7.0 6.0 9.0 2.0 10.0
grep
: grep
, grepl
, regexpr
and gregexpr
search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
grep(pattern, x, fixed=FALSE)
, where:
pattern = character string containing a regular expression to be matched in the given character vector.
x = a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.
If fixed=TRUE, it will do exact matching for the phrase anywhere in the vector (regular find)
Base R does not use these functions. Here is a "translator" of the stringr
function to base R functions
str_detect
- similar to grepl
(return logical)grep(value = FALSE)
is similar to which(str_detect())
str_subset
- similar to grep(value = TRUE)
- return value of matchedstr_replace
- similar to sub
- replace one timestr_replace_all
- similar to gsub
- replace many timesBase R:
(pattern, x)
(fixed = TRUE)
stringr
(string, pattern)
aka (x, pattern)
fixed(pattern)
These are the indices where the pattern match occurs:
grep("Rawlings",Sal$Name)
[1] 10256 10257 10258
which(grepl("Rawlings", Sal$Name))
[1] 10256 10257 10258
which(str_detect(Sal$Name, "Rawlings"))
[1] 10256 10257 10258
These are the indices where the pattern match occurs:
head(grepl("Rawlings",Sal$Name))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
head(str_detect(Sal$Name, "Rawlings"))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
grep("Rawlings",Sal$Name,value=TRUE)
[1] "Rawlings,Kellye A" "Rawlings,Paula M" [3] "Rawlings-Blake,Stephanie C"
Sal[grep("Rawlings",Sal$Name),]
Name JobTitle AgencyID 10256 Rawlings,Kellye A EMERGENCY DISPATCHER A40302 10257 Rawlings,Paula M COMMUNITY AIDE A04015 10258 Rawlings-Blake,Stephanie C MAYOR A01001 Agency HireDate AnnualSalary GrossPay 10256 M-R Info Technology (302) 01/06/2003 $48940.00 $73356.42 10257 R&P-Recreation (015) 12/10/2007 $19802.00 $10443.70 10258 Mayors Office (001) 12/07/1995 $167449.00 $165249.86
str_extract
str_extract
extracts just the matched string
ss = str_extract(Sal$Name, "Rawling") head(ss)
[1] NA NA NA NA NA NA
ss[ !is.na(ss)]
[1] "Rawling" "Rawling" "Rawling"
str_extract
and str_extract_all
str_extract_all
extracts all the matched strings
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "2" "6" "9" "4" "9"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]] [1] "0" "3" "0" "3" "1" [[2]] [1] "2" "9" "0" "4" "5"
head(grep("^Payne.*", x = Sal$Name, value = TRUE), 3)
[1] "Payne El,Boaz L" "Payne El,Jackie" [3] "Payne Johnson,Nickole A"
head(grep("Leonard.?S", x = Sal$Name, value = TRUE))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(grep("Spence.*C.*", x = Sal$Name, value = TRUE))
[1] "Spencer,Charles A" "Spencer,Clarence W" "Spencer,Michael C"
stringr
head(str_subset( Sal$Name, "^Payne.*"), 3)
[1] "Payne El,Boaz L" "Payne El,Jackie" [3] "Payne Johnson,Nickole A"
head(str_subset( Sal$Name, "Leonard.?S"))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(str_subset( Sal$Name, "Spence.*C.*"))
[1] "Spencer,Charles A" "Spencer,Clarence W" "Spencer,Michael C"
Let's say we wanted to sort the data set by Annual Salary:
class(Sal$AnnualSalary)
[1] "factor"
sort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
[1] "1" "10" "2"
order(c("1", "2", "10"))
[1] 1 3 2
So we must change the annual pay into a numeric:
head(Sal$AnnualSalary, 4)
[1] $55314.00 $74000.00 $64500.00 $46309.00 1654 Levels: $10000.00 $100000.00 $100013.00 $100200.00 ... $99994.00
head(as.numeric(Sal$AnnualSalary), 4)
[1] 908 1302 1094 722
R didn't like the $
so it thought turned them all to NA
.
sub()
and gsub()
can do the replacing part in base R.
Now we can replace the $
with nothing (used fixed=TRUE
because $
means ending):
Sal$AnnualSalary <- as.numeric(gsub(pattern = "$", replacement="", Sal$AnnualSalary, fixed=TRUE)) Sal <- Sal[order(Sal$AnnualSalary, decreasing=TRUE), ] Sal[1:5, c("Name", "AnnualSalary", "JobTitle")]
Name AnnualSalary JobTitle 8983 Mosby,Marilyn J 238772 STATE'S ATTORNEY 732 Batts,Anthony W 211785 Police Commissioner 13223 Wen,Leana 200000 Executive Director III 10274 Raymond,Henry J 192500 Executive Director III 12118 Swift,Michael 187200 CONTRACT SERV SPEC II
stringr
We can do the same thing (with 2 piping operations!) in dplyr
dplyr_sal = Sal dplyr_sal = dplyr_sal %>% mutate( AnnualSalary = AnnualSalary %>% str_replace( fixed("$"), "") %>% as.numeric) %>% arrange(desc(AnnualSalary)) check_Sal = Sal rownames(check_Sal) = NULL all.equal(check_Sal, dplyr_sal)
[1] TRUE