Finding words/phrases from a big dataframe with faster way

Question

I have a dataframe which contains 10137 rows (dataframe named phrases) with text and another data frame which contains 62000 terms (dataframe named words) which I would like to use in the first dataframe in order to find in with text of the first data frame the word of the second refers using 0 or 1 if it is not exist or exist respectively.

This snippet of code makes this process:

# Create some fake data
words <- c("stock", "revenue", "continuous improvement")
phrases <- c("blah blah stock and revenue", "yada yada revenue yada", 
             "continuous improvement is an unrealistic goal", 
             "phrase with no match")

# Apply the 'grepl' function along the list of words, and convert the result to numeric
df <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases))}))
# Name the columns the words that were searched
names(df) <- words

However the problem if I use my initial data as decsribed at the first lines is that it will take a long time. I try to find an efficient way in order to make the process faster. I though to give part in order to make it example (based on the volume of my dataframes)

 df_500 <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases[1:500]))}))
 df_1000 <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases[501:1000]))}))
 df_500 <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases[1:500]))}))
 df_1500 <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases[1001:1500]))}))

#and list the dataframe until 10137 like the rows of the first dataframe and after that merge the results into a dataframe.

How can I make this in parallel, because as it is now it will execute the command the one after other and the time will be the same? Is this the right solution to make it?

clemens · Accepted Answer

You can use the tm package and create a document term matrix and use a tokeniser from RWeka.

library(tm)
library(RWeka)

First, create the bigram tokeniser:

bigram_tokeniser <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

Then create a corpus from phrases:

corpus <- VCorpus(VectorSource(phrases))

In this case, only words in the vector words will be considered, you can change that by changing the control:

dtm <- DocumentTermMatrix(corpus, 
                          control = list(tokenize = bigram_tokeniser,
                                         dictionary = words))

You can then convert the document term matrix to a matrix and get the desired output:

as.matrix(dtm)

    Terms
Docs continuous improvement revenue stock
   1                      0       1     1
   2                      0       1     0
   3                      1       0     0
   4                      0       0     0

Finding words/phrases from a big dataframe with faster way

Answers (1)

Related Questions