Reputation: 937
I have a variable within a dataset that contains phrases I'd like to do a string search on (female$Var2
). I want to find the number of rows each phrase is present in another dataframe (female_df$MH2
). So for instance, female$Var2
looks like:
myocardial infarction drug therapy
imipramine poisoning
oximetry
thrombosis drug therapy
angioedema chemically induced
And I want to find the number of rows that contain each of the above phrases in the dataframe female_df$MH2
which looks like this
oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy
So my resulting output should look like this:
myocardial infarction drug therapy 1
imipramine poisoning 1
oximetry 2
thrombosis drug therapy 2
angioedema chemically induced 1
Note that's not number of total occurrences (see angioedema...). It's the number of rows that contain that phrase. I am currently running a for loop which is taking way too long because it's searching 5,000+ terms on 428,000+ rows. When I try vectorizing my function using occurrences_female(female$Var2)
, I get the In grepl(word, female_df$MH2, ignore.case = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
error, returning only the variable for the first female$Var2
This is the for loop I am running
for (i in 1:nrow(female))
{
word <- female$Var2[i]
df_female <- data.frame(word, occurrences_female(word))
df_female2 <- rbind(df_female2, df_female)
}
based on this function
occurrences_female <- function(word)
{
# inserts \\b in the beginning
word <- paste0("\\b", word)
# inserts \\b at the end
n <- nchar(word)
word <- paste(substr(word, 1, n), "\\b", sep = "")
occurrences <- sum(grepl(word, female_df$MH2, ignore.case = TRUE))
return (occurrences)
}
The function works when I do it manually but I need to have it done on 5,000+ terms and a for loop is way too slow (it's been running for over 2 hours). I don't know how to do a search of one variable of a dataframe on a variable from a different dataframe.
Upvotes: 0
Views: 1151
Reputation: 39154
We can use the following code to achieve the task. Benchmarking shows that this is approahc has good performance.
library(purrr)
library(stringr)
female$Count <- map_int(female$Var2,
function(x){sum(str_detect(female_df$MH2, pattern = x))})
There are multiple ways to count how many rows contains each word or phrase. But based on the answers and discussions in this thread so far, the general strategy to achieve this.
lapply
and sapply
from base R, or map
function from the purrr
package.grep
, grepl
from base R, or str_detect
or str_which
from the stringr
package.Since the OP has a huge amount of data to process, I conducted an analysis to compare which combinations of functions from base R, purrr
, and stringr
can achieve the same task taking the least amount of time.
I investigated a total of eight combinations. There are choices between using sapply
or map_int
, grep
or str_which
, and grepl
or str_detect
.
Here I created two data frames, female
and female_df
, based on OP's example. Notice that I set stringsAsFactors
to make sure each entire column is in character format.
# Create the example data frame: female
female <- data.frame(Var2 = c("myocardial infarction drug therapy",
"imipramine poisoning",
"oximetry",
"thrombosis drug therapy",
"angioedema chemically induced"),
stringsAsFactors = FALSE)
# Create the example data frame: female_df
female_df <- data.frame(MH2 = c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
"angioedema chemically induced, angioedema chemically induced, oximetry",
"abo blood group system, imipramine poisoning, adverse effects",
"isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
"thrombosis drug therapy"),
stringsAsFactors = FALSE)
I also load the required packages. microbenchmark
is a package to evaluate code performance.
# Load packages
library(purrr)
library(stringr)
library(microbenchmark)
Here is a list of combination of functions that can achieve OP's task.
This is from Luís Telles's answer. It uses sapply
and grepl
.
sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
This is from Dave2e's answer. It uses sapply
and grep
.
sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
This uses map_int
and str_detect
.
map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
[1] 1 1 2 2 1
This uses map_int
and str_which
.
map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
[1] 1 1 2 2 1
This uses map_int
and grepl
.
map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})
[1] 1 1 2 2 1
This uses map_int
and grep
.
map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})
[1] 1 1 2 2 1
This uses sapply
and str_detect
.
sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
This uses sapply
and str_which
.
sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})
myocardial infarction drug therapy imipramine poisoning
1 1
oximetry thrombosis drug therapy
2 2
angioedema chemically induced
1
All these combinations are valid answer. For example, we can do female$Count <
to store any of the results from these combinations.
Here I conducted benchmarking of these eight combinations with 30000 times sampling.
m <- microbenchmark(
C1 = {sapply(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
C2 = {sapply(female$Var2, function(x){length(grep(x, female_df$MH2))})},
C3 = {map_int(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
C4 = {map_int(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
C5 = {map_int(female$Var2, function(x){sum(grepl(pattern = x, female_df$MH2))})},
C6 = {map_int(female$Var2, function(x){length(grep(x, female_df$MH2))})},
C7 = {sapply(female$Var2, function(x){sum(str_detect(female_df$MH2, pattern = x))})},
C8 = {sapply(female$Var2, function(x){length(str_which(female_df$MH2, pattern = x))})},
times = 30000L
)
print(m)
Unit: microseconds
expr min lq mean median uq max neval
C1 166.144 200.784 1503.780 2192.261 2401.063 184228.81 30000
C2 163.578 198.860 1420.937 1460.653 2280.465 144553.22 30000
C3 189.238 231.575 1502.319 790.305 2386.309 146455.85 30000
C4 200.784 246.329 1461.714 1224.909 2306.125 184189.04 30000
C5 150.107 185.388 1452.586 1970.630 2376.687 32124.08 30000
C6 148.824 184.105 1398.312 1921.556 2259.937 145843.88 30000
C7 205.916 251.461 1516.979 851.246 2408.119 146305.10 30000
C8 215.538 264.932 1481.538 1508.764 2324.727 229709.16 30000
All these combinations have similar average time, but Combination 3, the use of map_int
and str_detect
, has the lowest median.
Upvotes: 4
Reputation: 24069
In your solution above the constant use of rbind
to add each row onto the data frame is very expensive in terms of processing time.
Here is a solution using the stringr package.
#Data set up
var2<-c("myocardial infarction drug therapy", "imipramine poisoning", "oximetry",
"thrombosis drug therapy", "angioedema chemically induced")
female<-data.frame(var2, stringsAsFactors = FALSE)
MH2<-c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects",
"angioedema chemically induced, angioedema chemically induced, oximetry",
"abo blood group system, imipramine poisoning, adverse effects",
"isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy",
"thrombosis drug therapy")
female_df<-data.frame(MH2, stringsAsFactors = FALSE)
library(stringr)
#create a matrix where columns is the terms
# and the rows are the lines checked.
termmatrix<-sapply(female$var2, function(x){str_count(female_df$MH2, x)})
#find the sums of the columns to determine the number of times each term is used
ans<-colSums(termmatrix)
The final ans
is a named vector with the terms and total counts.
Addition
To avoid creating a huge term matrix try:
ans<-sapply(female$var2, function(x){length(grep(x, female_df$MH2))})
A slight modification of Luis answer
Upvotes: 2
Reputation: 694
A solution with only base R (assuming your female$VAR2
only has unique strings):
counts <- sapply(female$VAR2, function(x){ z <- sum(grepl(pattern = x,
x = female_df$MH2,
ignore.case = TRUE))
z
})
word_counts <- cbind(female$VAR2, counts)
Upvotes: 2