Reputation: 31

Replace strings in text based on dictionary

I am new to R and need suggestions. I have a dataframe with 1 text field in it. I need to fix the misspelled words in that text field. To help with that, I have a second file (dictionary) with 2 columns - the misspelled words and the correct words to replace them.

How would you recommend doing it? I wrote a simple "for loop" but the performance is an issue. The file has ~120K rows and the dictionary has ~5k rows and the program's been running for hours. The text can have a max of 2000 characters.

Here is the code:

output<-source_file$MEMO_MANUAL_TXT
for (i in 1:nrow(fix_file))  {           #dictionary file
target<-paste0(" ", fix_file$change_to_target[i], " ")
replace<-paste0(" ", fix_file$target[i], " ")
output<-gsub(target, replace, output, fixed = TRUE)

Upvotes: 3

Answers (2)

TheComeOnMan

Reputation: 12905

I would try agrep. I'm not sure how well it scales though.

Eg.

> agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
[1] "1 lazy"

Also check out pmatch and charmatch although I feel they won't be as useful to you.

Upvotes: 2

agstudy

Reputation: 121588

here an example , to show @joran comment using a data.table left join. It is very fast (instantaneously here).

library(data.table)

n1 <- 120e3
n2 <- 1e3
set.seed(1)
## create vocab
tt <- outer(letters,letters,paste0)
vocab <- as.vector(outer(tt,tt,paste0))
## create the dictionary 
dict <- data.table(miss=sample(vocab,n2,rep=F),
                   good=sample(letters,n2,rep=T),key='miss')
## the text table
orig <- data.table(miss=sample(vocab,n1,rep=TRUE),key='miss')
orig[dict]

orig[dict]
      miss good
   1: aakq    v
   2: adac    t
   3: adxj    r
   4: aeye    t
   5: afji    g
  ---          
1027: zvia    d
1028: zygp    p
1029: zyjm    x
1030: zzak    t
1031: zzvs    q

Upvotes: 1

Replace strings in text based on dictionary

Answers (2)

Related Questions