YIFAN WANG
YIFAN WANG

Reputation: 21

How to clean and standardize words using R

I have some data like:

data<-data.frame(comment=c('scan','scanned','SCANNED','scan and sent','FAXED','faxed to','faxed- pt'))


1          scan
2       scanned
3       SCANNED
4 scan and sent
5         FAXED
6      faxed to
7     faxed- pt

I'm wondering how to use R to clean the data into:

1  scanned
2  scanned
3  scanned
4  scanned
5    faxed
6    faxed
7    faxed

Thanks!!

Upvotes: 1

Views: 137

Answers (3)

Tyler Rinker
Tyler Rinker

Reputation: 109874

Here's an approximate matching using agrepl in both a dplyr and data.table approach. Not too much different than solutions here but potentially less code:

comment <- c('scan', 'scanned', 'SCANNED', 'scan and sent', 'FAXED', 'faxed to', 'faxed- pt')

library(data.table)
data.table(comment)[, cleaned := ifelse(agrepl("fax", comment), "faxed", "scanned")][,]

library(dplyr)
data_frame(comment) %>%
    mutate(cleaned = ifelse(agrepl("fax", comment), "faxed", "scanned"))

Upvotes: 1

Alex Woolford
Alex Woolford

Reputation: 4563

You might want to checkout the stringdist package, e.g.:

library(stringdist)

toMatch <- c('scan', 'scanned', 'SCANNED', 'scan and sent', 'FAXED', 'faxed to', 'faxed- pt')
possibleValues <- c("scanned", "faxed")

possibleValues[amatch(x = toMatch, table = possibleValues, maxDist = Inf)]

Returns:

[1] "scanned" "scanned" "scanned" "scanned" "faxed"   "faxed"   "faxed"

Upvotes: 5

maccruiskeen
maccruiskeen

Reputation: 2818

This is an easy way to do it, but it depends on how dirty the rest of the data is. If there were any entries that include both scan and fax, this wouldn't work.

data<-data.frame(comment=c('scan','scanned','SCANNED','scan and sent','FAXED','faxed to','faxed- pt'))
data$cleaned <- tolower(data$comment)
data$cleaned <- ifelse(grepl("scan", data$cleaned), "scanned", data$cleaned)
data$cleaned <- ifelse(grepl("fax", data$cleaned), "faxed", data$cleaned)

This leaves you with:

R> data
        comment cleaned
1          scan scanned
2       scanned scanned
3       SCANNED scanned
4 scan and sent scanned
5         FAXED   faxed
6      faxed to   faxed
7     faxed- pt   faxed

Upvotes: 0

Related Questions