berinsj
berinsj

Reputation: 13

Replace similar incorrectly spelled words

I am trying to fix my survey data. My data frame contains multiple values that should be the same; however, there are variations in spelling, spacing, and capitalization leading to more than the expected number of levels.

str(data.frame$race)
"American Indian and Alaska Native" 
"Asian"                             
"Black of African American"        
"Black or African American"         
"Other"                            
"Unknown"                          
"white or Caucasian"                
"White or Caucasian"                
"White or Caucasion" 

How do I "find and replace" to create one unified spelling and convert it back to a factor with the appropriate number of levels?

Upvotes: 1

Views: 708

Answers (1)

akraf
akraf

Reputation: 3255

It's difficult to find a one size fits all solution. This is because strings which seem similar might describe very different things (e.g. Granada vs. Grenada). The comments under the original post are worth to look into.

See "Approximate string matching" on Wikipedia (Also sometimes called "fuzzy matching"). There are many ways to define "similar" on strings as you can see.

The most basic tool is the R function adist. It calculates the so-called edit distance.

x <- c("American Indian and Alaska Native" ,
   "Asian"                             ,
   "Black of African American"        ,
   "Black or African American"         ,
   "Other"                            ,
   "Unknown"                          ,
   "white or Caucasian"                ,
   "White or Caucasian"                ,
   "White or Caucasion" )
u <- unique(x)
# compare all strings against each other
d <- adist(u)
# Do not list combinations of similar words twice
d[lower.tri(d)] <- NA
# Say your threshold below which you want to consider strings similar is 
# 2 edits:
a <- which(d > 0 & d < 2, arr.ind = TRUE)
a
##      row col
## [1,]   3   4
## [2,]   7   8
## [3,]   8   9
pairs <- cbind(u[a[,1]], u[a[,2]])
pairs
##      [,1]                        [,2]                       
## [1,] "Black of African American" "Black or African American"
## [2,] "white or Caucasian"        "White or Caucasian"       
## [3,] "White or Caucasian"        "White or Caucasion" 

But in the end you will have to curate the results yourself to avoid accidential equalization of unequal factors.

You can do this reproducably by using a named vector as a translation dictionary. For example, from looking at the above example I could create the following dictionary:

dict <- c(
   # incorrect spellings          correct spellings
   # -------------------------    ----------------------------
   "Black of African American" =  "Black or African American",
   "white or Caucasian"        =  "white or Caucasian"       ,
   "White or Caucasion"        =  "White or Caucasian" 
)
# The correct levels need to be included, to
dict <- c(dict, setNames(u,u)

Then convert your factor column to character by using as.character and apply the dictionary on it like I do here with the original character vector x:

xcorrected <- dict[x]
# show without names, but the result is also correct if you just use
# xcorrected alone (remove as.character here to see the difference).
as.character(xcorrected)
[1] "American Indian and Alaska Native" "Asian"                            
[3] "Black or African American"         "Black or African American"        
[5] "Other"                             "Unknown"                          
[7] "white or Caucasian"                "White or Caucasian"               
[9] "White or Caucasian"              

Upvotes: 1

Related Questions