Max TC
Max TC

Reputation: 79

replacement of words in strings

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.

How can I search a string, a word that matches and replace it?

The expected result is the following example:

a1<- c(" the classroom is ful ")
a2<- c(" full")

In this case I would be replacing ful for full in a1

Upvotes: 2

Views: 1991

Answers (4)

Eric Watt
Eric Watt

Reputation: 3230

Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.

library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
#  [1] "fool" "flu"  "fl"   "fuel" "furl" "foul" "full" "fun"  "fur"  "fut"  "fol"  "fug"  "fum" 

So even in your example, would you want to replace ful with full, or many of the other options here?

The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.

library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "

But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.

a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3, 
                paste("\\b", 
                      hunspell(a3)[[1]], 
                      "\\b", 
                      collapse = "", sep = ""), 
                hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "

Update

Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.

library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"

Update 2

Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".

Your example:

a <- c(" thin, thic, thi") 
badwords.corpus <- c("thin", "thic", "thi" ) 
goodwords.corpus <- c("think", "thick", "this")

The solution is to modify badwords.corpus

badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"

Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a, vect.corpus)
# [1] " think, thick, this" 

Upvotes: 4

nghauran
nghauran

Reputation: 6768

For a kind of ordered replacement, you can try this

a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")

qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)

For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example

a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"

library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
  patt <- paste0('\\b', badword, '\\b')
  repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
  final.word <- ifelse(is.na(repl), badword, repl)
  a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 269586

Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.

library(gsubfn)

L <- list(ful = "full")  # can add more words to this list if desired

gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "

Upvotes: 0

tobiaspk1
tobiaspk1

Reputation: 388

I think the function you are looking for is gsub():

gsub (pattern = "ful", replacement = a2, x = a1)

Upvotes: 0

Related Questions