How to compare two different columns(both contains string) efficiently in R?

Question

Suppose A is a data frame and structure of A is as follows

Row no                              C1                                 C2                  
1                               I am fine                             1234
2                               He is fine                            1234
3                               am better                             1234
4                               better butter                         1234
5                                fine good                            1234
6                               good to be better                     1234

and B is another data frame such that

Row no                           C1                                                  
1                               fine                             
2                               good

I want to compare A$C1 with B$C1 and the string in B$C1 should be contains in A$C1. So when I will compare A$C1 with B$C1 the result will be the row number in A which contains the B's strings. For the above scenario the output will be 1, 2, 5,6 as 1,2,5 contains the word "fine" and 6 as it contains the word "good". I don't want to compare "good" with row 5 of A as I have already selected row 5. I want an efficient solution for this as the number of rows for my real data (A) set is around 400000 and B is around 10000

Martin Morgan · Accepted Answer

This function

phrasesWithWords <- function(x, table)
{
    words <- strsplit(x, "\W")
    found <- relist(unlist(words) %in% table, words)
    which(sapply(found, any))
}

works on your phrases and a table of acceptable words:

phrase <- c("I am fine", "He is fine", "am better", "better butter",
            "fine good", "good to be better")
table <- c("fine", "good")
phrasesWithWords(phrase, table)

The function works by spliting the phrases into words, then looking up each word (without looping through the long list of phrases) in the table, re-listing the logical vector, and asking which list elements contain at least one TRUE.

This turns out not to be so efficient compared to a simple grep solution

f1 <- function(x, table)
    grep(paste(table, collapse="|"), x)

with

library(microbenchmark)
x1000 <- rep(x, 1000)

giving

> microbenchmark(phrasesWithWords(x1000, table), f1(x1000, table),
+                times=5)
Unit: milliseconds
                           expr        min         lq     median         uq
 phrasesWithWords(x1000, table) 130.167172 132.815303 133.011161 133.112888
               f1(x1000, table)   2.959576   2.973416   2.990412   3.060494
        max neval
 134.504282     5
   3.439293     5

The pretty neat package "lineprof" shows that for a modified function

f0 <- function(x, table)
{
    words <- strsplit(x, "\W")
    idx <- unlist(words) %in% table
    found <- relist(idx, words)
    which(sapply(found, any))
}

the main bottleneck is in relist

> lineprof(f0(x1000, table))
Reducing depth to 2 (from 7)
Common path: words.R!30719TCY
   time  alloc release  dups                ref         src
1 0.003  0.668       0     0 words.R!30719TCY#3 f0/strsplit
2 0.024 28.240       0 17393 words.R!30719TCY#5 f0/relist  
3 0.003  3.959       0  6617 words.R!30719TCY#6 f0/which

leading to a more elaborate approach

f2 <- function(x, table)
{
    words <- strsplit(x, "\W")
    len <- cumsum(sapply(words, length))
    idx <- cumsum(unlist(words) %in% table)
    which(idx[len] != c(0, idx[head(len, -1)]))
}

which is somewhat better-performing

> identical(f2(x1000, table), f1(x1000, table))
[1] TRUE
> microbenchmark(f2(x1000, table), f1(x1000, table), times=5)
Unit: milliseconds
             expr       min        lq    median        uq       max neval
 f2(x1000, table) 25.426832 25.815504 25.844033 26.075279 26.387559     5
 f1(x1000, table)  2.963365  2.968197  2.984395  2.984423  3.129873     5

I think both f2 and f1 would scale well enough to the problem in the original question, provided there is sufficient memory (if the table of acceptable words is small compared to the phrases, then I think the grep approach will actually be more memory efficient; in the end I think I might up-vote the simple grep solution!). Maybe the major limitation with the grep approach is that the size of the regular expression is limited, on my computer at about 2560 terms

> grep(paste(as.character(1:2559), collapse="|"), "1")
[1] 1
> grep(paste(as.character(1:2560), collapse="|"), "1")
Error in grep(paste(as.character(1:2560), collapse = "|"), "1") : 
  invalid regular expression '1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|4

How to compare two different columns(both contains string) efficiently in R?

Answers (2)

Related Questions