Reputation: 937

Vectorize for loop that finds occurrences in R

I have a variable within a dataset that contains phrases I'd like to do a string search on (female$Var2). I want to find the number of rows each phrase is present in another dataframe (female_df$MH2). So for instance, female$Var2 looks like:

myocardial infarction drug therapy
imipramine poisoning
oximetry
thrombosis drug therapy
angioedema chemically induced

And I want to find the number of rows that contain each of the above phrases in the dataframe female_df$MH2 which looks like this

oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy

So my resulting output should look like this:

myocardial infarction drug therapy          1
imipramine poisoning                        1
oximetry                                    2
thrombosis drug therapy                     2
angioedema chemically induced               1

Note that's not number of total occurrences (see angioedema...). It's the number of rows that contain that phrase. I am currently running a for loop which is taking way too long because it's searching 5,000+ terms on 428,000+ rows. When I try vectorizing my function using occurrences_female(female$Var2), I get the In grepl(word, female_df$MH2, ignore.case = TRUE) : argument 'pattern' has length > 1 and only the first element will be used error, returning only the variable for the first female$Var2

This is the for loop I am running

for (i in 1:nrow(female))
{
  word <- female$Var2[i]
  df_female <- data.frame(word, occurrences_female(word))
  df_female2 <- rbind(df_female2, df_female)
}

based on this function

occurrences_female <- function(word)
{
  # inserts \\b in the beginning
  word <- paste0("\\b", word)

  # inserts \\b at the end
  n <- nchar(word)
  word <- paste(substr(word, 1, n), "\\b", sep = "")

  occurrences <- sum(grepl(word, female_df$MH2, ignore.case = TRUE))

  return (occurrences)
}

The function works when I do it manually but I need to have it done on 5,000+ terms and a for loop is way too slow (it's been running for over 2 hours). I don't know how to do a search of one variable of a dataframe on a variable from a different dataframe.

Upvotes: 0

Answers (3)

www

Reputation: 39154

Summary

We can use the following code to achieve the task. Benchmarking shows that this is approahc has good performance.

library(purrr)
library(stringr)

female$Count <- map_int(female$Var2, 
                    function(x){sum(str_detect(female_df$MH2, pattern = x))})

Introduction

There are multiple ways to count how many rows contains each word or phrase. But based on the answers and discussions in this thread so far, the general strategy to achieve this.

Use a function to vectorize the operation, such as lapply and sapply from base R, or map function from the purrr package.
Use a function to count or detect if a particular pattern (word or phrase) is in a string. These functions are like grep, grepl from base R, or str_detect or str_which from the stringr package.

Since the OP has a huge amount of data to process, I conducted an analysis to compare which combinations of functions from base R, purrr, and stringr can achieve the same task taking the least amount of time.

I investigated a total of eight combinations. There are choices between using sapply or map_int, grep or str_which, and grepl or str_detect.

Data Preparation

Here I created two data frames, female and female_df, based on OP's example. Notice that I set stringsAsFactors to make sure each entire column is in character format.

# Create the example data frame: female
female <- data.frame(Var2 = c("myocardial infarction drug therapy", 
                              "imipramine poisoning",
                              "oximetry",
                              "thrombosis drug therapy",
                              "angioedema chemically induced"),
                     stringsAsFactors = FALSE)

# Create the example data frame: female_df
female_df <- data.frame(MH2 = c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
                                "angioedema chemically induced, angioedema chemically induced, oximetry",
                                "abo blood group system, imipramine poisoning, adverse effects",
                                "isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
                                "thrombosis drug therapy"),
                        stringsAsFactors = FALSE)

I also load the required packages. microbenchmark is a package to evaluate code performance.

# Load packages
library(purrr)
library(stringr)
library(microbenchmark)

Combination of Functions

Here is a list of combination of functions that can achieve OP's task.

Combination 1

This is from Luís Telles's answer. It uses sapply and grepl.

sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})

myocardial infarction drug therapy               imipramine poisoning 
                                 1                                  1 
                          oximetry            thrombosis drug therapy 
                                 2                                  2 
     angioedema chemically induced 
                                 1

Combination 2

This is from Dave2e's answer. It uses sapply and grep.

sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})

myocardial infarction drug therapy               imipramine poisoning 
                                 1                                  1 
                          oximetry            thrombosis drug therapy 
                                 2                                  2 
     angioedema chemically induced 
                                 1

Combination 3

This uses map_int and str_detect.

map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
[1] 1 1 2 2 1

Combination 4

This uses map_int and str_which.

map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
[1] 1 1 2 2 1

Combination 5

This uses map_int and grepl.

map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})
[1] 1 1 2 2 1

Combination 6

This uses map_int and grep.

map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})
[1] 1 1 2 2 1

Combination 7

This uses sapply and str_detect.

sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
myocardial infarction drug therapy               imipramine poisoning 
                                 1                                  1 
                          oximetry            thrombosis drug therapy 
                                 2                                  2 
     angioedema chemically induced 
                                 1

Combination 8

This uses sapply and str_which.

sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
myocardial infarction drug therapy               imipramine poisoning 
                                 1                                  1 
                          oximetry            thrombosis drug therapy 
                                 2                                  2 
     angioedema chemically induced 
                                 1

All these combinations are valid answer. For example, we can do female$Count < to store any of the results from these combinations.

Microbenchmark

Here I conducted benchmarking of these eight combinations with 30000 times sampling.

m <- microbenchmark(
  C1 = {sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
  C2 = {sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})},
  C3 = {map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
  C4 = {map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
  C5 = {map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
  C6 = {map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})},
  C7 = {sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
  C8 = {sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
  times = 30000L
)

print(m)

Unit: microseconds
 expr     min      lq     mean   median       uq       max neval
   C1 166.144 200.784 1503.780 2192.261 2401.063 184228.81 30000
   C2 163.578 198.860 1420.937 1460.653 2280.465 144553.22 30000
   C3 189.238 231.575 1502.319  790.305 2386.309 146455.85 30000
   C4 200.784 246.329 1461.714 1224.909 2306.125 184189.04 30000
   C5 150.107 185.388 1452.586 1970.630 2376.687  32124.08 30000
   C6 148.824 184.105 1398.312 1921.556 2259.937 145843.88 30000
   C7 205.916 251.461 1516.979  851.246 2408.119 146305.10 30000
   C8 215.538 264.932 1481.538 1508.764 2324.727 229709.16 30000

All these combinations have similar average time, but Combination 3, the use of map_int and str_detect, has the lowest median.

Upvotes: 4

Dave2e

Reputation: 24069

In your solution above the constant use of rbind to add each row onto the data frame is very expensive in terms of processing time.

Here is a solution using the stringr package.

#Data set up
var2<-c("myocardial infarction drug therapy", "imipramine poisoning", "oximetry",
             "thrombosis drug therapy", "angioedema chemically induced")
female<-data.frame(var2, stringsAsFactors = FALSE)

MH2<-c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
"angioedema chemically induced, angioedema chemically induced, oximetry",
                "abo blood group system, imipramine poisoning, adverse effects",
                "isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
                "thrombosis drug therapy")
female_df<-data.frame(MH2, stringsAsFactors = FALSE)

library(stringr)
#create a matrix where columns is the terms
# and the rows are the lines checked.
termmatrix<-sapply(female$var2, function(x){str_count(female_df$MH2, x)})
#find the sums of the columns to determine the number of times each term is used
ans<-colSums(termmatrix)

The final ans is a named vector with the terms and total counts.

Addition
To avoid creating a huge term matrix try:

ans<-sapply(female$var2, function(x){length(grep(x, female_df$MH2))})

A slight modification of Luis answer

Upvotes: 2

Luís Telles

Reputation: 694

A solution with only base R (assuming your female$VAR2 only has unique strings):

counts <- sapply(female$VAR2, function(x){ z <- sum(grepl(pattern = x,
                                                    x = female_df$MH2,
                                                    ignore.case = TRUE))
                                      z
                                     })
word_counts <- cbind(female$VAR2, counts)