Reputation: 367
I am new to R. I am working on a very large dataset with 1M+ rows and 1000+ columns that needs serious data cleaning. I have managed to clean the data for most part, however that has left me with some messy repetitive code.
I am trying to use grep to find rows matching the invalid data and replace them with the correct values. However, when I OR multiple terms in my search criteria it doesn't work as expected. I will explain this with an example so it makes sense.
Here's a small data frame to show what’s going on.
The data frame has three columns, state, fatalities, event. The event column has the messy data that needs to be cleaned.
state <- c("AL", "CA", "AZ", "MA", "MN", "NY", "NJ", "GA", "TX", "MN")
fatalities <- c(2,1,4,6,7,9,1,10,1,7)
event <- c("THUNDERSTORM", "THUNERSTORM W","THUNDERSTROM","TSTM","HUNDERSTORM","THUNDERSTORM","THUNDERSTORM","THUNDERSTORM","THUNDERSTORM","THUNDERSTORM")
df <- data.frame(state,fatalities,event, stringsAsFactors = FALSE)
df$Corrected <- ''
df
state fatalities event Corrected
1 AL 2 THUNDERSTORM
2 CA 1 THUNERSTORM W
3 AZ 4 THUNDERSTROM
4 MA 6 TSTM
5 MN 7 HUNDERSTORM
6 NY 9 THUNDERSTORM
7 NJ 1 THUNDERSTORM
8 GA 10 THUNDERSTORM
9 TX 1 THUNDERSTORM
10 MN 7 THUNDERSTORM
As you can see, the event has things like ‘TSTM’, ‘THUNERSTORM’ etc that need to be corrected to 'THUNDERSTORM'.
If I simply say
grep("TSTM", df$event)
then it works fine as shown below.
df$Corrected[grep("TSTM", df$event)] <- "THUNDERSTORM"
df
state fatalities event Corrected
1 AL 2 THUNDERSTORM
2 CA 1 THUNERSTORM W
3 AZ 4 THUNDERSTROM
4 MA 6 TSTM THUNDERSTORM
...
However, the problem with this is if I have 100 variants, I will end up with 100 statements.
If I play with OR operator and try something like this, it does not work. My idea was to list all the variants in one place and then just replace all occurrences in one shot (or at least place the valid value in the Corrected column for the matching rows)
Now, if I try to OR multiple terms as shown below:
df$Corrected[grep("TSTM | THUNERSTORM", df$event)] <- "THUNDERSTORM"
df
as you can see, there is no change and the Corrected column remains blank as if it did not find any matches.
state fatalities event Corrected
1 AL 2 THUNDERSTORM
2 CA 1 THUNERSTORM W
3 AZ 4 THUNDERSTROM
4 MA 6 TSTM
5 MN 7 HUNDERSTORM
6 NY 9 THUNDERSTORM
7 NJ 1 THUNDERSTORM
8 GA 10 THUNDERSTORM
9 TX 1 THUNDERSTORM
10 MN 7 THUNDERSTORM
Why doesn't grep like my OR statement? Am I missing something obvious? I tried the suggestion from this thread but it did not work.
Secondly, is there a better approach, considering I will have hundreds of such cleanup statement for each category? Meaning, can I convert it into a function that takes two parameters, filter to be matched and the string replacement and change the data directly?
e.g.
CleanUp <- function(filter, replacement) df$Corrected[grep(filter, df$event)] <- replacement
And then I can keep calling it as:
CleanUp("TSTM", "THUNDERSTORM")
CleanUp("THUNERSTORM", "THUNDERSTORM")
...etc
I tried this and it did not work either. I couldn't figure out why it did not work. Things I have considered and could not get to work consistently : different regular expressions, egrep, dplyr contains. Lastly, sorry if it comes off as 'a way too long' post to ask something simple.
Upvotes: 0
Views: 122
Reputation: 145765
Comments to an answer. To your first question: Spaces matter in regex. "TSTM | THUNERSTORM" is searching for "TSTM " or " THUNERSTORM" with trailing/leading spaces. Try "TSTM|THUNERSTORM"
For functionalizing things, R functions don't usually modify outside objects. You could re-write your function
CleanUp <- function(df, filter, replacement) {
df$Corrected[grep(filter, df$event)] <- replacement
return(df)
}
Then call it as
df <- CleanUp(df, "TSTM", "THUNDERSTORM")
A nicer way might be to compile a vector of regex patterns and programmatically construct the grep
call:
patterns = c("ERSTORM", "TSTM", "THUNER")
df$Corrected = ifelse(grep(paste(patterns, collapse = "|"), df$event),
"THUNDERSTORM", df$Corrected)
Upvotes: 4