tanay
tanay

Reputation: 458

How to compare two different columns(both contains string) efficiently in R?

Suppose A is a data frame and structure of A is as follows

Row no                              C1                                 C2                  
1                               I am fine                             1234
2                               He is fine                            1234
3                               am better                             1234
4                               better butter                         1234
5                                fine good                            1234
6                               good to be better                     1234

and B is another data frame such that

Row no                           C1                                                  
1                               fine                             
2                               good

I want to compare A$C1 with B$C1 and the string in B$C1 should be contains in A$C1. So when I will compare A$C1 with B$C1 the result will be the row number in A which contains the B's strings. For the above scenario the output will be 1, 2, 5,6 as 1,2,5 contains the word "fine" and 6 as it contains the word "good". I don't want to compare "good" with row 5 of A as I have already selected row 5. I want an efficient solution for this as the number of rows for my real data (A) set is around 400000 and B is around 10000

Upvotes: 1

Views: 223

Answers (2)

Martin Morgan
Martin Morgan

Reputation: 46886

This function

phrasesWithWords <- function(x, table)
{
    words <- strsplit(x, "\\W")
    found <- relist(unlist(words) %in% table, words)
    which(sapply(found, any))
}

works on your phrases and a table of acceptable words:

phrase <- c("I am fine", "He is fine", "am better", "better butter",
            "fine good", "good to be better")
table <- c("fine", "good")
phrasesWithWords(phrase, table)

The function works by spliting the phrases into words, then looking up each word (without looping through the long list of phrases) in the table, re-listing the logical vector, and asking which list elements contain at least one TRUE.

This turns out not to be so efficient compared to a simple grep solution

f1 <- function(x, table)
    grep(paste(table, collapse="|"), x)

with

library(microbenchmark)
x1000 <- rep(x, 1000)

giving

> microbenchmark(phrasesWithWords(x1000, table), f1(x1000, table),
+                times=5)
Unit: milliseconds
                           expr        min         lq     median         uq
 phrasesWithWords(x1000, table) 130.167172 132.815303 133.011161 133.112888
               f1(x1000, table)   2.959576   2.973416   2.990412   3.060494
        max neval
 134.504282     5
   3.439293     5

The pretty neat package "lineprof" shows that for a modified function

f0 <- function(x, table)
{
    words <- strsplit(x, "\\W")
    idx <- unlist(words) %in% table
    found <- relist(idx, words)
    which(sapply(found, any))
}

the main bottleneck is in relist

> lineprof(f0(x1000, table))
Reducing depth to 2 (from 7)
Common path: words.R!30719TCY
   time  alloc release  dups                ref         src
1 0.003  0.668       0     0 words.R!30719TCY#3 f0/strsplit
2 0.024 28.240       0 17393 words.R!30719TCY#5 f0/relist  
3 0.003  3.959       0  6617 words.R!30719TCY#6 f0/which   

leading to a more elaborate approach

f2 <- function(x, table)
{
    words <- strsplit(x, "\\W")
    len <- cumsum(sapply(words, length))
    idx <- cumsum(unlist(words) %in% table)
    which(idx[len] != c(0, idx[head(len, -1)]))
}

which is somewhat better-performing

> identical(f2(x1000, table), f1(x1000, table))
[1] TRUE
> microbenchmark(f2(x1000, table), f1(x1000, table), times=5)
Unit: milliseconds
             expr       min        lq    median        uq       max neval
 f2(x1000, table) 25.426832 25.815504 25.844033 26.075279 26.387559     5
 f1(x1000, table)  2.963365  2.968197  2.984395  2.984423  3.129873     5

I think both f2 and f1 would scale well enough to the problem in the original question, provided there is sufficient memory (if the table of acceptable words is small compared to the phrases, then I think the grep approach will actually be more memory efficient; in the end I think I might up-vote the simple grep solution!). Maybe the major limitation with the grep approach is that the size of the regular expression is limited, on my computer at about 2560 terms

> grep(paste(as.character(1:2559), collapse="|"), "1")
[1] 1
> grep(paste(as.character(1:2560), collapse="|"), "1")
Error in grep(paste(as.character(1:2560), collapse = "|"), "1") : 
  invalid regular expression '1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|4

Upvotes: 3

Carlos Cinelli
Carlos Cinelli

Reputation: 11617

grep can do the job for you:

grep(paste(B$C1, collapse="|"), A$C1)
1 2 5 6

The code above gets you all the lines in A$C1 that contains at least one word of B$C1, that is, lines 1, 2, 5 and 6. The first argument is a regular expression, that's why we collapse the words with "|" (that means "or").

And it seems scalable. Benchmarking with 100.000 sample phrases (from your phrases) and the two words, grep takes only 0.076 seconds.

Upvotes: 3

Related Questions