Reputation: 21

How to clean and standardize words using R

I have some data like:

data<-data.frame(comment=c('scan','scanned','SCANNED','scan and sent','FAXED','faxed to','faxed- pt'))


1          scan
2       scanned
3       SCANNED
4 scan and sent
5         FAXED
6      faxed to
7     faxed- pt

I'm wondering how to use R to clean the data into:

1  scanned
2  scanned
3  scanned
4  scanned
5    faxed
6    faxed
7    faxed

Thanks!!

Upvotes: 1

Answers (3)

Tyler Rinker

Reputation: 109874

Here's an approximate matching using agrepl in both a dplyr and data.table approach. Not too much different than solutions here but potentially less code:

comment <- c('scan', 'scanned', 'SCANNED', 'scan and sent', 'FAXED', 'faxed to', 'faxed- pt')

library(data.table)
data.table(comment)[, cleaned := ifelse(agrepl("fax", comment), "faxed", "scanned")][,]

library(dplyr)
data_frame(comment) %>%
    mutate(cleaned = ifelse(agrepl("fax", comment), "faxed", "scanned"))

Upvotes: 1

Alex Woolford

Reputation: 4563

You might want to checkout the stringdist package, e.g.:

library(stringdist)

toMatch <- c('scan', 'scanned', 'SCANNED', 'scan and sent', 'FAXED', 'faxed to', 'faxed- pt')
possibleValues <- c("scanned", "faxed")

possibleValues[amatch(x = toMatch, table = possibleValues, maxDist = Inf)]

Returns:

[1] "scanned" "scanned" "scanned" "scanned" "faxed"   "faxed"   "faxed"

Upvotes: 5

maccruiskeen

Reputation: 2818

This is an easy way to do it, but it depends on how dirty the rest of the data is. If there were any entries that include both scan and fax, this wouldn't work.

data<-data.frame(comment=c('scan','scanned','SCANNED','scan and sent','FAXED','faxed to','faxed- pt'))
data$cleaned <- tolower(data$comment)
data$cleaned <- ifelse(grepl("scan", data$cleaned), "scanned", data$cleaned)
data$cleaned <- ifelse(grepl("fax", data$cleaned), "faxed", data$cleaned)

This leaves you with:

R> data
        comment cleaned
1          scan scanned
2       scanned scanned
3       SCANNED scanned
4 scan and sent scanned
5         FAXED   faxed
6      faxed to   faxed
7     faxed- pt   faxed

Upvotes: 0

How to clean and standardize words using R

Answers (3)

Related Questions