In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
MOST IMPORTANT RULE - LOOK AT YOUR DATA!
In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
MOST IMPORTANT RULE - LOOK AT YOUR DATA!
is.na - is TRUE if the data is FALSE otherwise! - negation (NOT)
is.na(x) is TRUE, then !is.na(x) is FALSEall takes in a logical and will be TRUE if ALL are TRUE
all(!is.na(x)) - are all values of x NOT NAany will be TRUE if ANY are true
any(is.na(x)) - do we have any NA's in x?complete.cases - returns TRUE if EVERY value of a row is NOT NA
FALSE missing one value (even if not important)One of the most important aspects of data cleaning is missing values.
Types of "missing" data:
NA - general missing dataNaN - stands for "Not a Number", happens when you do 0/0.Inf and -Inf - Infinity, happens when you take a positive number (or negative number) by 0.Each missing data type has a function that returns TRUE if the data is missing:
NA - is.naNaN - is.nanInf and -Inf - is.infiniteis.finite returns FALSE for all missing data and TRUE for non-missingOne important aspect (esp with subsetting) is that logical operations return NA for NA values. Think about it, the data could be > 2 or not we don't know, so R says there is no TRUE or FALSE, so that is missing:
x = c(0, NA, 2, 3, 4) x > 2
[1] FALSE NA FALSE TRUE TRUE
What to do? What if we want if x > 2 and x isn't NA?
Don't do x != NA, do x > 2 and x is NOT NA:
x != NA
[1] NA NA NA NA NA
x > 2 & !is.na(x)
[1] FALSE FALSE FALSE TRUE TRUE
What about seeing if a value is equal to multiple values? You can do (x == 1 | x == 2) & !is.na(x), but that is not efficient.
(x == 0 | x == 2) # has NA
[1] TRUE NA TRUE FALSE FALSE
(x == 0 | x == 2) & !is.na(x) # No NA
[1] TRUE FALSE TRUE FALSE FALSE
what to do?
%in%NEVER has NA, even if you put it there (BUT DON'T DO THIS):
x %in% c(0, 2, NA) # NEVER has NA and returns logical
[1] TRUE TRUE TRUE FALSE FALSE
x %in% c(0, 2) | is.na(x)
[1] TRUE TRUE TRUE FALSE FALSE
Similarly with logicals, operations/arithmetic with NA will result in NAs:
x + 2
[1] 2 NA 4 5 6
x * 2
[1] 0 NA 4 6 8
unique - gives you the unique values of a variabletable(x) - will give a one-way table of x
table(x, useNA = "ifany") - will have row NAtable(x, y) - will give a cross-tab of x and yHere we will use table to make tabulations of the data. Look at ?table to see options for missing data.
unique(x)
[1] 0 NA 2 3 4
table(x)
x 0 2 3 4 1 1 1 1
table(x, useNA = "ifany") # will not
x 0 2 3 4 <NA> 1 1 1 1 1
useNA = "ifany" will not have NA in table heading if no NA:
table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
useNA = "ifany")
0 1 2 3 1 1 4 4
You can set useNA = "always" to have it always have a column for NA
table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
useNA = "always")
0 1 2 3 <NA> 1 1 4 4 0
If you use a factor, all levels will be given even if no exist! - (May be wanted or not):
fac = factor(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
levels = 1:4)
tab = table(fac)
tab
fac 1 2 3 4 1 4 4 0
tab[ tab > 0 ]
fac 1 2 3 1 4 4
A two-way table. If you pass in 2 vectors, table creates a 2-dimensional table.
tab <- table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3),
useNA = "always")
tab
0 1 2 3 4 <NA>
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 2 0 2 0
3 0 0 0 4 0 0
<NA> 0 0 0 0 0 0
margin.table finds the marginal sums of the table. margin is 1 for rows, 2 for columns in general in R. Here is the column sums of the table:
margin.table(tab, 2)
0 1 2 3 4 <NA> 1 1 2 4 2 0
prop.table finds the marginal proportions of the table. Think of it dividing the table by it's respective marginal totals. If margin not set, divides by overall total.
prop.table(tab)
0 1 2 3 4 <NA>
0 0.1 0.0 0.0 0.0 0.0 0.0
1 0.0 0.1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.2 0.0 0.2 0.0
3 0.0 0.0 0.0 0.4 0.0 0.0
<NA> 0.0 0.0 0.0 0.0 0.0 0.0
prop.table(tab,1) * 100
0 1 2 3 4 <NA>
0 100 0 0 0 0 0
1 0 100 0 0 0 0
2 0 0 50 0 50 0
3 0 0 0 100 0 0
<NA>
From https://data.baltimorecity.gov/City-Government/Baltimore-City-Employee-Salaries-FY2015/nsfe-bg53 https://data.baltimorecity.gov/api/views/nsfe-bg53/rows.csv
Read the CSV into R Sal:
Sal = read_csv("http://data.baltimorecity.gov/api/views/nsfe-bg53/rows.csv")
Sal = rename(Sal, Name = name)
any() - checks if there are any TRUEsall() - checks if ALL are truehead(Sal,2)
Name JobTitle AgencyID
1 Aaron,Patricia G Facilities/Office Services II A03031
2 Aaron,Petra L ASSISTANT STATE'S ATTORNEY A29045
Agency HireDate AnnualSalary GrossPay
1 OED-Employment Dev (031) 10/24/1979 $55314.00 $53626.04
2 States Attorneys Office (045) 09/25/2006 $74000.00 $73000.08
any(is.na(Sal$Name)) # are there any NAs?
[1] FALSE
For example, let's say gender was coded as Male, M, m, Female, F, f. Using Excel to find all of these would be a matter of filtering and changing all by hand or using if statements.
In R, you can simply do something like:
data$gender[data$gender %in%
c("Male", "M", "m")] <- "Male"
or use ifelse
data %>%
mutate(gender = ifelse(gender %in% c("Male", "M", "m"),
"Male", gender))
Sometimes though, it's not so simple. That's where functions that find patterns come in very useful.
table(gender)
gender
F FeMAle FEMALE Fm M Ma mAle Male MaLe MALE
75 82 74 89 89 79 87 89 88 95
Man Woman
73 80
paste and paste0Paste can be very useful for joining vectors together:
paste("Visit", 1:5, sep = "_")
[1] "Visit_1" "Visit_2" "Visit_3" "Visit_4" "Visit_5"
paste("Visit", 1:5, sep = "_", collapse = " ")
[1] "Visit_1 Visit_2 Visit_3 Visit_4 Visit_5"
paste("To", "is going be the ", "we go to the store!", sep = "day ")
[1] "Today is going be the day we go to the store!"
# and paste0 can be even simpler see ?paste0
paste0("Visit",1:5)
[1] "Visit1" "Visit2" "Visit3" "Visit4" "Visit5"
paste(1:5)
[1] "1" "2" "3" "4" "5"
paste(1:5, collapse = " ")
[1] "1 2 3 4 5"
Useful String functions
toupper(), tolower() - uppercase or lowercase your data:str_trim() (in the stringr package) or trimws in base
nchar - get the number of characters in a stringstringr packageLike dplyr, the stringr package:
data.frame in dplyr. matches any single character* means repeat as many (even if 0) more times the last character? makes the last thing optional^ matches start of vector ^a - starts with "a"$ matches end of vector b$ - ends with "b"Very similar:
Base R
substr(x, start, stop) - substrings from position start to position stopstrsplit(x, split) - splits strings up - returns list!stringr
str_sub(x, start, end) - substrings from position start to position endstr_split(string, pattern) - splits strings up - returns list!In base R, strsplit splits a vector on a string into a list
x <- c("I really", "like writing", "R code programs")
y <- strsplit(x, split = " ") # returns a list
y
[[1]] [1] "I" "really" [[2]] [1] "like" "writing" [[3]] [1] "R" "code" "programs"
stringrstringr::str_split does the same thing:
library(stringr) y2 <- str_split(x, " ") # returns a list y2
[[1]] [1] "I" "really" [[2]] [1] "like" "writing" [[3]] [1] "R" "code" "programs"
One example case is when you want to split on a period ".". In regular expressions . means ANY character, so
str_split("I.like.strings", ".")
[[1]] [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
str_split("I.like.strings", fixed("."))
[[1]] [1] "I" "like" "strings"
ysuppressPackageStartupMessages(library(dplyr)) # must be loaded AFTER plyr y[[2]]
[1] "like" "writing"
sapply(y, dplyr::first) # on the fly
[1] "I" "like" "R"
sapply(y, nth, 2) # on the fly
[1] "really" "writing" "code"
sapply(y, last) # on the fly
[1] "really" "writing" "programs"
stringrstr_detect, str_subset, str_replace, and str_replace_all search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
str_detect - returns TRUE if pattern is foundstr_subset - returns only the strings which pattern were detected
x[str_detect(x, pattern)]str_extract - returns only strings which pattern were detected, but ONLY the patternstr_replace - replaces pattern with replacement the first timestr_replace_all - replaces pattern with replacement as many times matchedstringr?modifiers
fixed - match everything exactlyregexp - default - uses regular expressionsignore_case is an option to not have to use tolowerstringr and dplyrstr_subset(Sal$Name, "Rawlings")
[1] "Rawlings,Kellye A" "Rawlings,Paula M" [3] "Rawlings-Blake,Stephanie C"
Sal %>% filter(str_detect(Name, "Rawlings"))
Name JobTitle AgencyID
1 Rawlings,Kellye A EMERGENCY DISPATCHER A40302
2 Rawlings,Paula M COMMUNITY AIDE A04015
3 Rawlings-Blake,Stephanie C MAYOR A01001
Agency HireDate AnnualSalary GrossPay
1 M-R Info Technology (302) 01/06/2003 $48940.00 $73356.42
2 R&P-Recreation (015) 12/10/2007 $19802.00 $10443.70
3 Mayors Office (001) 12/07/1995 $167449.00 $165249.86
stringrWe can do the same thing (with 2 piping operations!) in dplyr
dplyr_sal = Sal
dplyr_sal = dplyr_sal %>% mutate(
AnnualSalary = AnnualSalary %>%
str_replace(fixed("$"), "") %>%
as.numeric) %>%
arrange(desc(AnnualSalary))
str_extract and str_extract_allstr_extract_all extracts all the matched strings - \\d searches for DIGITS/numbers
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "2" "6" "9" "4" "9"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]] [1] "0" "3" "0" "3" "1" [[2]] [1] "2" "9" "0" "4" "5"
sort - reorders the data - characters work, but notrank - gives the rank of the data - ties are splitorder - gives the indices, if subset, would give the data sorted
x[order(x)] is the same as sortingsort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
[1] "1" "10" "2"
order(c("1", "2", "10"))
[1] 1 3 2
x = rnorm(10) x[1] = x[2] # create a tie rank(x)
[1] 3.5 3.5 1.0 8.0 5.0 7.0 6.0 9.0 2.0 10.0
grep: grep, grepl, regexpr and gregexpr search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
grep(pattern, x, fixed=FALSE), where:
pattern = character string containing a regular expression to be matched in the given character vector.
x = a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.
If fixed=TRUE, it will do exact matching for the phrase anywhere in the vector (regular find)
Base R does not use these functions. Here is a "translator" of the stringr function to base R functions
str_detect - similar to grepl (return logical)grep(value = FALSE) is similar to which(str_detect())str_subset - similar to grep(value = TRUE) - return value of matchedstr_replace - similar to sub - replace one timestr_replace_all - similar to gsub - replace many timesBase R:
(pattern, x)(fixed = TRUE)stringr
(string, pattern) aka (x, pattern)fixed(pattern)These are the indices where the pattern match occurs:
grep("Rawlings",Sal$Name)
[1] 10256 10257 10258
which(grepl("Rawlings", Sal$Name))
[1] 10256 10257 10258
which(str_detect(Sal$Name, "Rawlings"))
[1] 10256 10257 10258
These are the indices where the pattern match occurs:
head(grepl("Rawlings",Sal$Name))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
head(str_detect(Sal$Name, "Rawlings"))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
grep("Rawlings",Sal$Name,value=TRUE)
[1] "Rawlings,Kellye A" "Rawlings,Paula M" [3] "Rawlings-Blake,Stephanie C"
Sal[grep("Rawlings",Sal$Name),]
Name JobTitle AgencyID
10256 Rawlings,Kellye A EMERGENCY DISPATCHER A40302
10257 Rawlings,Paula M COMMUNITY AIDE A04015
10258 Rawlings-Blake,Stephanie C MAYOR A01001
Agency HireDate AnnualSalary GrossPay
10256 M-R Info Technology (302) 01/06/2003 $48940.00 $73356.42
10257 R&P-Recreation (015) 12/10/2007 $19802.00 $10443.70
10258 Mayors Office (001) 12/07/1995 $167449.00 $165249.86
str_extractstr_extract extracts just the matched string
ss = str_extract(Sal$Name, "Rawling") head(ss)
[1] NA NA NA NA NA NA
ss[ !is.na(ss)]
[1] "Rawling" "Rawling" "Rawling"
str_extract and str_extract_allstr_extract_all extracts all the matched strings
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "2" "6" "9" "4" "9"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]] [1] "0" "3" "0" "3" "1" [[2]] [1] "2" "9" "0" "4" "5"
head(grep("^Payne.*", x = Sal$Name, value = TRUE), 3)
[1] "Payne El,Boaz L" "Payne El,Jackie" [3] "Payne Johnson,Nickole A"
head(grep("Leonard.?S", x = Sal$Name, value = TRUE))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(grep("Spence.*C.*", x = Sal$Name, value = TRUE))
[1] "Spencer,Charles A" "Spencer,Clarence W" "Spencer,Michael C"
stringrhead(str_subset( Sal$Name, "^Payne.*"), 3)
[1] "Payne El,Boaz L" "Payne El,Jackie" [3] "Payne Johnson,Nickole A"
head(str_subset( Sal$Name, "Leonard.?S"))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(str_subset( Sal$Name, "Spence.*C.*"))
[1] "Spencer,Charles A" "Spencer,Clarence W" "Spencer,Michael C"
Let's say we wanted to sort the data set by Annual Salary:
class(Sal$AnnualSalary)
[1] "factor"
sort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
[1] "1" "10" "2"
order(c("1", "2", "10"))
[1] 1 3 2
So we must change the annual pay into a numeric:
head(Sal$AnnualSalary, 4)
[1] $55314.00 $74000.00 $64500.00 $46309.00 1654 Levels: $10000.00 $100000.00 $100013.00 $100200.00 ... $99994.00
head(as.numeric(Sal$AnnualSalary), 4)
[1] 908 1302 1094 722
R didn't like the $ so it thought turned them all to NA.
sub() and gsub() can do the replacing part in base R.
Now we can replace the $ with nothing (used fixed=TRUE because $ means ending):
Sal$AnnualSalary <- as.numeric(gsub(pattern = "$", replacement="",
Sal$AnnualSalary, fixed=TRUE))
Sal <- Sal[order(Sal$AnnualSalary, decreasing=TRUE), ]
Sal[1:5, c("Name", "AnnualSalary", "JobTitle")]
Name AnnualSalary JobTitle 8983 Mosby,Marilyn J 238772 STATE'S ATTORNEY 732 Batts,Anthony W 211785 Police Commissioner 13223 Wen,Leana 200000 Executive Director III 10274 Raymond,Henry J 192500 Executive Director III 12118 Swift,Michael 187200 CONTRACT SERV SPEC II
stringrWe can do the same thing (with 2 piping operations!) in dplyr
dplyr_sal = Sal
dplyr_sal = dplyr_sal %>% mutate(
AnnualSalary = AnnualSalary %>%
str_replace(
fixed("$"),
"") %>%
as.numeric) %>%
arrange(desc(AnnualSalary))
check_Sal = Sal
rownames(check_Sal) = NULL
all.equal(check_Sal, dplyr_sal)
[1] TRUE