Reputation: 458
Suppose A is a data frame and structure of A is as follows
Row no C1 C2
1 I am fine 1234
2 He is fine 1234
3 am better 1234
4 better butter 1234
5 fine good 1234
6 good to be better 1234
and B is another data frame such that
Row no C1
1 fine
2 good
I want to compare A$C1 with B$C1 and the string in B$C1 should be contains in A$C1. So when I will compare A$C1 with B$C1 the result will be the row number in A which contains the B's strings. For the above scenario the output will be 1, 2, 5,6 as 1,2,5 contains the word "fine" and 6 as it contains the word "good". I don't want to compare "good" with row 5 of A as I have already selected row 5. I want an efficient solution for this as the number of rows for my real data (A) set is around 400000 and B is around 10000
Upvotes: 1
Views: 223
Reputation: 46886
This function
phrasesWithWords <- function(x, table)
{
words <- strsplit(x, "\\W")
found <- relist(unlist(words) %in% table, words)
which(sapply(found, any))
}
works on your phrases and a table of acceptable words:
phrase <- c("I am fine", "He is fine", "am better", "better butter",
"fine good", "good to be better")
table <- c("fine", "good")
phrasesWithWords(phrase, table)
The function works by spliting the phrases into words, then looking up each word (without looping through the long list of phrases) in the table, re-listing the logical vector, and asking which list elements contain at least one TRUE.
This turns out not to be so efficient compared to a simple grep solution
f1 <- function(x, table)
grep(paste(table, collapse="|"), x)
with
library(microbenchmark)
x1000 <- rep(x, 1000)
giving
> microbenchmark(phrasesWithWords(x1000, table), f1(x1000, table),
+ times=5)
Unit: milliseconds
expr min lq median uq
phrasesWithWords(x1000, table) 130.167172 132.815303 133.011161 133.112888
f1(x1000, table) 2.959576 2.973416 2.990412 3.060494
max neval
134.504282 5
3.439293 5
The pretty neat package "lineprof" shows that for a modified function
f0 <- function(x, table)
{
words <- strsplit(x, "\\W")
idx <- unlist(words) %in% table
found <- relist(idx, words)
which(sapply(found, any))
}
the main bottleneck is in relist
> lineprof(f0(x1000, table))
Reducing depth to 2 (from 7)
Common path: words.R!30719TCY
time alloc release dups ref src
1 0.003 0.668 0 0 words.R!30719TCY#3 f0/strsplit
2 0.024 28.240 0 17393 words.R!30719TCY#5 f0/relist
3 0.003 3.959 0 6617 words.R!30719TCY#6 f0/which
leading to a more elaborate approach
f2 <- function(x, table)
{
words <- strsplit(x, "\\W")
len <- cumsum(sapply(words, length))
idx <- cumsum(unlist(words) %in% table)
which(idx[len] != c(0, idx[head(len, -1)]))
}
which is somewhat better-performing
> identical(f2(x1000, table), f1(x1000, table))
[1] TRUE
> microbenchmark(f2(x1000, table), f1(x1000, table), times=5)
Unit: milliseconds
expr min lq median uq max neval
f2(x1000, table) 25.426832 25.815504 25.844033 26.075279 26.387559 5
f1(x1000, table) 2.963365 2.968197 2.984395 2.984423 3.129873 5
I think both f2 and f1 would scale well enough to the problem in the original question, provided there is sufficient memory (if the table of acceptable words is small compared to the phrases, then I think the grep approach will actually be more memory efficient; in the end I think I might up-vote the simple grep solution!). Maybe the major limitation with the grep approach is that the size of the regular expression is limited, on my computer at about 2560 terms
> grep(paste(as.character(1:2559), collapse="|"), "1")
[1] 1
> grep(paste(as.character(1:2560), collapse="|"), "1")
Error in grep(paste(as.character(1:2560), collapse = "|"), "1") :
invalid regular expression '1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|4
Upvotes: 3
Reputation: 11617
grep
can do the job for you:
grep(paste(B$C1, collapse="|"), A$C1)
1 2 5 6
The code above gets you all the lines in A$C1
that contains at least one word of B$C1
, that is, lines 1, 2, 5 and 6. The first argument is a regular expression, that's why we collapse the words with "|"
(that means "or").
And it seems scalable. Benchmarking with 100.000 sample phrases (from your phrases) and the two words, grep
takes only 0.076 seconds.
Upvotes: 3