Reputation: 11
I've found variants on this issue but can't get the suggested solutions working in my situation. I'm pretty new to R with no other coding experience so it may be I'm just missing something basic. Thanks for any help!!
I have a data table with a column of names of organisations, call it Orgs$OrgName
. Sometimes there are misspellings of words within the strings that make up the organisation names. I have a look-up table (imported from csv with common misspellings in one column (spelling$misspelt
) and their corrections in another column (spelling$correct
).
I want to find any parts of OrgName
strings which match spelling$misspelt
and replace just those parts with spelling$correct
.
I have tried various solutions based on mgsub, stri_replace_all_fixed, str_replace_all (replacement of words in strings has been my main reference). But nothing has worked and all the examples appear to be based on manually created vectors using vect1 <- c("item1", "item2", item3") rather than based on a lookup table.
Example of my data:
OrgName
-----------------------------------------------
WAIROA DISTRICT COUNCIL
MANUTAI MARAE COMMITTEE
C S AUTOTECH LTD
NEW ZEALAND INSTITUTE OF SPORT
BRAUHAUS FRINGS
CHRISTCHURCH YOUNG MENS CHRISTIAN ASSOCIATION
The lookup table:
Misspelt Correct
--------------------------------
ABANDONNED ABANDONED
ABERATION ABERRATION
ABILITYES ABILITIES
ABILTIES ABILITIES
ABILTY ABILITY
ABONDON ABANDON
(There's no misspellings in the first few lines of org names but there's 57000+ more in the dataset)
UPDATE: here's what I have tried based on the update to the second response (trying that first as it's simpler). It hasn't worked, but hopefully someone can see where it's gone wrong?
library("stringi")
Orgs <- data.frame(OrgNameClean$OrgNameClean)
head(Orgs)
head(OrgNameClean)
write.csv(spelling$misspelt,file = "wrong.csv")
write.csv(spelling$correctspelling,file = "corrected.csv")
patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")
head(patterns)
head(replacements)
for(i in 1:nrow(Orgs)) {
row <- Orgs[i,]
print(as.character(row))
#print(stri_replace_all_fixed(row, patterns, replacements,
vectorize_all=FALSE))
row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+%
"\\b", replacements, vectorize_all=FALSE)
print(row)
Orgs[i,] <- row
}
head(Orgs)
Orgsdt <- data.table(Orgs)
head(Orgsdt)
chckspellchk <- Orgsdt[OrgNameClean.OrgNameClean %like% "ENVIORNMENT",,]
##should return no rows if spelling correction worked
head(chckspellchk)
#OrgNameClean.OrgNameClean
#1: SMART ENVIORNMENTAL LTD
UPDATE 2: more information - there are spaces in the spelling lookup if that makes a difference:
head(spelling[misspelt %like% " ",,])
Misspelt correctspelling
------------------------------------
COCA COLA COCA
TORTISE TORTOISE
> head(spelling[correctspelling %like% " "])
Misspelt correctspelling
------------------------------------
ABOUTA ABOUT A
ABOUTIT ABOUT IT
ABOUTTHE ABOUT THE
ALOT A LOT
ANYOTHER ANY OTHER
ASFAR AS FAR
Upvotes: 1
Views: 3347
Reputation: 11
I came across a similar issue and might have a tidyverse
-style solution.
stringr::str_replace_all
should let us do multiple replacements using a named vector.
With the lookup data frame of misspelled and corrected values we could turn that into a named vector. Then we could use that named vector as a lookup in str_replace_all
.
Here is an example using some of the misspelled and corrected values provided previously.
library(tidyverse)
# load data frame of misspelled and corrected values
foo <- read_csv("mispelt, correct
ABANDONNED, ABANDONED
ABERATION, ABERRATION
ABILITYES, ABILITIES
ABILTIES, ABILITIES
ABILTY, ABILITY
ABONDON, ABANDON
COCA COLA, COCA
TORTISE, TORTOISE
ABOUTA, ABOUT A
ABOUTIT, ABOUT IT
ABOUTTHE, ABOUT THE
ALOT, A LOT
ANYOTHER, ANY OTHER
ASFAR, AS FAR",
col_types = "c")
# str_replace_all requires a named vector of replacements
# the value of the vector is the correction,
# while the name of each value is the search string to replace
lookup <- foo$correct
names(lookup) <- foo$mispelt
# data frame to test our lookup named vector
tbl <- tibble(old = foo$mispelt)
# mutating to a new column to show replacement works,
# but we could just overwrite the old column as well using mutate
mutate(tbl, new = str_replace_all(old, lookup))
I did not deal with upper or lower case considerations as I'm just demonstrating the named vector usage in str_replace_all
and the examples were all upper case. However, regular expressions and/or the regex
function could probably help with that if necessary.
Session info:
|package |loadedversion |
|:---------|:-------------|
|dplyr |1.0.7 |
|readr |2.0.0 |
|stringr |1.4.0 |
|tibble |3.1.3 |
|tidyverse |1.3.1 |
Upvotes: 1
Reputation: 2336
This answer is potentially too complicated for a new programmer, and I may be writing this more like Python than R (I'm getting a bit rusty on the latter)* but I have a proposed solution for your problem, which isn't trivial by the way. The issues I foresee you having with other solutions you looked at is that they individually only address one small part of the larger puzzle, which is that you need to be able to check every word inside every string against your lookup table. The simplest way I see to do this is to write a number of small functions to do what you need and then use R's family of apply functions to loop through entries and use the functions.
The only other tricky thing here is using an R environment
as your lookup table. For whatever reason in R people don't seem to talk much about or really use hash tables (the real name for a lookup table) but they are very common in other languages. Luckily R's environments
are actually just an implementation of a C hash table, which is good because hashes are very fast and allow you to directly map one value to another. (More on this here, if interested.)
*I welcome comments or edits from others that would make my answer simpler or more R-idiomatic
# Some example data - note stringsAsFactors=FALSE is critical for this to work
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL',
'C S AUTOTECH LTD',
'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
stringsAsFactors = FALSE)
spelling_df <- data.frame("Mistake" = c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW'),
"Correct"= c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD'),
stringsAsFactors = FALSE)
# Function to convert your data frame to a hash table
create_hash <- function(in_df){
hash_table <- new.env(hash=TRUE)
for(i in seq(nrow(in_df)))
{
hash_table[[in_df[i, 1]]] <- in_df[i, 2]
}
return(hash_table)
}
# Make the hash table out of your data frame
spelling_hash <- create_hash(spelling_df)
# Try it out:
print(spelling_hash[['ABANDONNED']]) # prints ABANDONED
# Now make a function to apply the lookup - and ensure
# if the string is not in the lookup table, you return the
# original string instead (instead of NULL)
apply_hash <- function(in_string, hash_table=spelling_hash){
x = hash_table[[in_string]]
if(!is.null(x)){
return(x)
}
else{
return(in_string)
}
}
# Finally make a function to break the full company name apart,
# apply the lookup on each word, and then paste it back together
correct_spelling <- function(bad_string) {
split_string <- strsplit(as.character(bad_string), " ")
new_split <- lapply(split_string[[1]], apply_hash)
return(paste(new_split, collapse=' '))
}
# Make a new field that applies the spelling correction
Orgs$Corrected <- sapply(Orgs$OrgName, correct_spelling)
Upvotes: 0
Reputation: 19000
We can use stringi's stri_replace_*_all()
to do multiple replacements on a whole string.
library("stringi")
string <- "WAIROA ABANDONNED COUNCIL','C S AUTOTECH LTD', 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES"
mistake <- c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW')
corrected <- c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD')
stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE)
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
Output:
[1] "WAIROA ABANDONED COUNCIL','C S AUTOTECH SGM', 'OLD ZEALAND INSTITUTE OF ABERRATION ABILITIES"
Some notes:
stri_replace_all_fixed
replaces occurrences of a fixed pattern matches.
stri_replace_all_regex
uses a regular expression pattern instead. This allows us to specify word boundaries: \b
to avoid substring matches (an alternative to \bword\b
is (?<=\W)word(?=\W)
).
vectorize_all
is set to FALSE, otherwise each replacement is applied to a new copy of the original sentence. See details here.
Full sample:
library("stringi")
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL',
' SMART ENVIORNMENTAL LTD',
'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
stringsAsFactors = FALSE)
patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")
for(i in 1:nrow(Orgs)) {
row <- Orgs[i,]
print(as.character(row))
row <- stri_replace_all_fixed(row, patterns, replacements, vectorize_all=FALSE)
#row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
print(row)
Orgs[i,] <- row
}
PS: I've made a separate CSV with a single headerless column for each character vector. But there are many other ways to read a CSV with R and convert the columns to a char vector.
PS2: If you want substring matches, eg. match ENVIORNMENT
in ENVIORNMENTAL
do not use stri_replace_all_regex()
along with word boundaries \b
. This is a great tutorial to buff-up your regex skills.
Upvotes: 1