Reputation: 21
I have some data like:
data<-data.frame(comment=c('scan','scanned','SCANNED','scan and sent','FAXED','faxed to','faxed- pt'))
1 scan
2 scanned
3 SCANNED
4 scan and sent
5 FAXED
6 faxed to
7 faxed- pt
I'm wondering how to use R to clean the data into:
1 scanned
2 scanned
3 scanned
4 scanned
5 faxed
6 faxed
7 faxed
Thanks!!
Upvotes: 1
Views: 137
Reputation: 109874
Here's an approximate matching using agrepl
in both a dplyr and data.table approach. Not too much different than solutions here but potentially less code:
comment <- c('scan', 'scanned', 'SCANNED', 'scan and sent', 'FAXED', 'faxed to', 'faxed- pt')
library(data.table)
data.table(comment)[, cleaned := ifelse(agrepl("fax", comment), "faxed", "scanned")][,]
library(dplyr)
data_frame(comment) %>%
mutate(cleaned = ifelse(agrepl("fax", comment), "faxed", "scanned"))
Upvotes: 1
Reputation: 4563
You might want to checkout the stringdist
package, e.g.:
library(stringdist)
toMatch <- c('scan', 'scanned', 'SCANNED', 'scan and sent', 'FAXED', 'faxed to', 'faxed- pt')
possibleValues <- c("scanned", "faxed")
possibleValues[amatch(x = toMatch, table = possibleValues, maxDist = Inf)]
Returns:
[1] "scanned" "scanned" "scanned" "scanned" "faxed" "faxed" "faxed"
Upvotes: 5
Reputation: 2818
This is an easy way to do it, but it depends on how dirty the rest of the data is. If there were any entries that include both scan
and fax
, this wouldn't work.
data<-data.frame(comment=c('scan','scanned','SCANNED','scan and sent','FAXED','faxed to','faxed- pt'))
data$cleaned <- tolower(data$comment)
data$cleaned <- ifelse(grepl("scan", data$cleaned), "scanned", data$cleaned)
data$cleaned <- ifelse(grepl("fax", data$cleaned), "faxed", data$cleaned)
This leaves you with:
R> data
comment cleaned
1 scan scanned
2 scanned scanned
3 SCANNED scanned
4 scan and sent scanned
5 FAXED faxed
6 faxed to faxed
7 faxed- pt faxed
Upvotes: 0