SCHNZ
SCHNZ

Reputation: 11

R find and replace partial string based on lookup table

I've found variants on this issue but can't get the suggested solutions working in my situation. I'm pretty new to R with no other coding experience so it may be I'm just missing something basic. Thanks for any help!!

I have a data table with a column of names of organisations, call it Orgs$OrgName. Sometimes there are misspellings of words within the strings that make up the organisation names. I have a look-up table (imported from csv with common misspellings in one column (spelling$misspelt) and their corrections in another column (spelling$correct).

I want to find any parts of OrgName strings which match spelling$misspelt and replace just those parts with spelling$correct.

I have tried various solutions based on mgsub, stri_replace_all_fixed, str_replace_all (replacement of words in strings has been my main reference). But nothing has worked and all the examples appear to be based on manually created vectors using vect1 <- c("item1", "item2", item3") rather than based on a lookup table.

Example of my data:

OrgName
-----------------------------------------------
WAIROA DISTRICT COUNCIL
MANUTAI MARAE COMMITTEE
C S AUTOTECH LTD
NEW ZEALAND INSTITUTE OF SPORT
BRAUHAUS FRINGS
CHRISTCHURCH YOUNG MENS CHRISTIAN ASSOCIATION

The lookup table:

Misspelt         Correct 
--------------------------------
ABANDONNED       ABANDONED            
ABERATION        ABERRATION            
ABILITYES        ABILITIES            
ABILTIES         ABILITIES            
ABILTY           ABILITY            
ABONDON          ABANDON

(There's no misspellings in the first few lines of org names but there's 57000+ more in the dataset)

UPDATE: here's what I have tried based on the update to the second response (trying that first as it's simpler). It hasn't worked, but hopefully someone can see where it's gone wrong?

library("stringi")
Orgs <- data.frame(OrgNameClean$OrgNameClean)
head(Orgs)
head(OrgNameClean)

write.csv(spelling$misspelt,file = "wrong.csv")
write.csv(spelling$correctspelling,file = "corrected.csv")

patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")
head(patterns)
head(replacements)

for(i in 1:nrow(Orgs)) {
  row <- Orgs[i,]
  print(as.character(row))
  #print(stri_replace_all_fixed(row, patterns, replacements, 
vectorize_all=FALSE))
  row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+% 
"\\b", replacements, vectorize_all=FALSE)
  print(row)
  Orgs[i,] <- row
}

head(Orgs)
Orgsdt <- data.table(Orgs)
head(Orgsdt)
chckspellchk <- Orgsdt[OrgNameClean.OrgNameClean %like% "ENVIORNMENT",,] 
##should return no rows if spelling correction worked
head(chckspellchk)

#OrgNameClean.OrgNameClean
#1:   SMART ENVIORNMENTAL LTD

UPDATE 2: more information - there are spaces in the spelling lookup if that makes a difference:

head(spelling[misspelt %like% " ",,])

Misspelt            correctspelling 
------------------------------------
COCA COLA           COCA            
TORTISE             TORTOISE      
      
> head(spelling[correctspelling %like% " "])

Misspelt       correctspelling 
------------------------------------
ABOUTA         ABOUT A             
ABOUTIT        ABOUT IT             
ABOUTTHE       ABOUT THE             
ALOT           A LOT       
ANYOTHER       ANY OTHER             
ASFAR          AS FAR 

Upvotes: 1

Views: 3347

Answers (3)

greufek
greufek

Reputation: 11

I came across a similar issue and might have a tidyverse-style solution.

stringr::str_replace_all should let us do multiple replacements using a named vector.

With the lookup data frame of misspelled and corrected values we could turn that into a named vector. Then we could use that named vector as a lookup in str_replace_all.

Here is an example using some of the misspelled and corrected values provided previously.

library(tidyverse)

# load data frame of misspelled and corrected values
foo <- read_csv("mispelt, correct
                ABANDONNED, ABANDONED
                ABERATION, ABERRATION
                ABILITYES, ABILITIES
                ABILTIES, ABILITIES
                ABILTY, ABILITY
                ABONDON, ABANDON
                COCA COLA, COCA            
                TORTISE, TORTOISE
                ABOUTA, ABOUT A             
                ABOUTIT, ABOUT IT             
                ABOUTTHE, ABOUT THE             
                ALOT, A LOT       
                ANYOTHER, ANY OTHER             
                ASFAR, AS FAR", 
                col_types = "c")


# str_replace_all requires a named vector of replacements
# the value of the vector is the correction,
#   while the name of each value is the search string to replace
lookup <- foo$correct
names(lookup) <- foo$mispelt

# data frame to test our lookup named vector
tbl <- tibble(old = foo$mispelt)

# mutating to a new column to show replacement works, 
#   but we could just overwrite the old column as well using mutate
mutate(tbl, new = str_replace_all(old, lookup))

I did not deal with upper or lower case considerations as I'm just demonstrating the named vector usage in str_replace_all and the examples were all upper case. However, regular expressions and/or the regex function could probably help with that if necessary.

Session info:

|package   |loadedversion |
|:---------|:-------------|
|dplyr     |1.0.7         |
|readr     |2.0.0         |
|stringr   |1.4.0         |
|tibble    |3.1.3         |
|tidyverse |1.3.1         |

Upvotes: 1

HFBrowning
HFBrowning

Reputation: 2336

This answer is potentially too complicated for a new programmer, and I may be writing this more like Python than R (I'm getting a bit rusty on the latter)* but I have a proposed solution for your problem, which isn't trivial by the way. The issues I foresee you having with other solutions you looked at is that they individually only address one small part of the larger puzzle, which is that you need to be able to check every word inside every string against your lookup table. The simplest way I see to do this is to write a number of small functions to do what you need and then use R's family of apply functions to loop through entries and use the functions.

The only other tricky thing here is using an R environment as your lookup table. For whatever reason in R people don't seem to talk much about or really use hash tables (the real name for a lookup table) but they are very common in other languages. Luckily R's environments are actually just an implementation of a C hash table, which is good because hashes are very fast and allow you to directly map one value to another. (More on this here, if interested.)

*I welcome comments or edits from others that would make my answer simpler or more R-idiomatic

# Some example data - note stringsAsFactors=FALSE is critical for this to work
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL', 
                                 'C S AUTOTECH LTD', 
                                 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
                   stringsAsFactors = FALSE)

spelling_df <- data.frame("Mistake" = c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW'),
                          "Correct"= c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD'),
                       stringsAsFactors = FALSE)


# Function to convert your data frame to a hash table
create_hash <- function(in_df){
  hash_table <- new.env(hash=TRUE)
  for(i in seq(nrow(in_df)))
  {
    hash_table[[in_df[i, 1]]] <- in_df[i, 2]
  }
  return(hash_table)
}

# Make the hash table out of your data frame
spelling_hash <- create_hash(spelling_df)

# Try it out:
print(spelling_hash[['ABANDONNED']])  # prints ABANDONED

# Now make a function to apply the lookup - and ensure
# if the string is not in the lookup table, you return the 
# original string instead (instead of NULL)
apply_hash <- function(in_string, hash_table=spelling_hash){
  x = hash_table[[in_string]]
  if(!is.null(x)){
    return(x)
  }
  else{
    return(in_string)
  }
}

# Finally make a function to break the full company name apart, 
# apply the lookup on each word, and then paste it back together
correct_spelling <- function(bad_string) {
  split_string <- strsplit(as.character(bad_string), " ")
  new_split <- lapply(split_string[[1]], apply_hash)
  return(paste(new_split, collapse=' '))
}

# Make a new field that applies the spelling correction
Orgs$Corrected <- sapply(Orgs$OrgName, correct_spelling)

Upvotes: 0

wp78de
wp78de

Reputation: 19000

We can use stringi's stri_replace_*_all() to do multiple replacements on a whole string.

library("stringi")
string <- "WAIROA ABANDONNED COUNCIL','C S AUTOTECH LTD', 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES"
mistake <- c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW')
corrected <- c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD')

stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE)    
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)

Output:

[1] "WAIROA ABANDONED COUNCIL','C S AUTOTECH SGM', 'OLD ZEALAND INSTITUTE OF ABERRATION ABILITIES"

Some notes:

  • stri_replace_all_fixed replaces occurrences of a fixed pattern matches.

  • stri_replace_all_regex uses a regular expression pattern instead. This allows us to specify word boundaries: \b to avoid substring matches (an alternative to \bword\b is (?<=\W)word(?=\W)).

  • vectorize_all is set to FALSE, otherwise each replacement is applied to a new copy of the original sentence. See details here.

Full sample:

library("stringi")
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL', 
                                 ' SMART ENVIORNMENTAL LTD',
                                 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
                   stringsAsFactors = FALSE)

patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")

for(i in 1:nrow(Orgs)) {
  row <- Orgs[i,]
  print(as.character(row))
  row <- stri_replace_all_fixed(row, patterns, replacements, vectorize_all=FALSE)
  #row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
  print(row)
  Orgs[i,] <- row
}

PS: I've made a separate CSV with a single headerless column for each character vector. But there are many other ways to read a CSV with R and convert the columns to a char vector.

PS2: If you want substring matches, eg. match ENVIORNMENT in ENVIORNMENTAL do not use stri_replace_all_regex() along with word boundaries \b. This is a great tutorial to buff-up your regex skills.

Upvotes: 1

Related Questions