Reputation: 1895
I have a dataset of tweets downloaded with rtweet. And i'd like to see how many times three different strings occur in the variable x$mentions_screen_name
.
The key thing I'm trying to do is do a count of how many times 'A' occurs, then 'B', then 'C'. So my attempt at reproducing this is as follows.
#These are the strings I would like to count
var<-c('A', 'B', 'C')
#The variable that contains the strings looks like this
library(stringi)
df<-data.frame(var1=stri_rand_strings(100, length=3, '[A-C]'))
#How do I count how many cases contain A, then B and then C.?
library(purrr)
df%>%
map(var, grepl(., df$var1))
Upvotes: 1
Views: 1645
Reputation: 109954
I think you may want something different than what others have posted. I may be wrong but the phrase you used:
'A' occurs, then 'B', then 'C'
Indicates to me you want to check if somethings happen in a particular order.
If this is the case may I suggest that you can make your question more explicit. You provide a MWE example but it could be made more minimal without the need for stringi (which I love as a package) because I doubt your tweets look anything like "ACB"
in reality. Hand making 3-5 strings could accomplish this without loading another package. Also showing your desired output makes the problem more explicit with less need for explanation.
df <- data_frame(var1=c(
"I think A is good But then C.",
"'A' occurs, then 'B', then 'C'",
"and a then lower with b that c will fail",
NA,
"what about A, B, C and another ABC",
"CBA?",
"last null"
))
var <- c('A', 'B', 'C')
library(stringi); library(dplyr)
df%>%
mutate(
count_abc = stringi::stri_count_regex(
var1,
paste(var, collapse = '.*?')
),
indicator = count_abc > 0
)
## var1 count_abc indicator
## 1 I think A is good But then C. 1 TRUE
## 2 'A' occurs, then 'B', then 'C' 1 TRUE
## 3 and a then lower with b that c will fail 0 FALSE
## 4 <NA> NA NA
## 5 what about A, B, C and another ABC 2 TRUE
## 6 CBA? 0 FALSE
## 7 last null 0 FALSE
## or if you only care about the summary compute it directly
df%>%
summarize(
count_abc = sum(stringi::stri_detect_regex(
var1,
paste(var, collapse = '.*?')
), na.rm = TRUE)
)
## count_abc
## 1 3
If I'm wrong my apologies for my misunderstanding.
Upvotes: 1
Reputation: 20095
Another option using stringr
and sapply
could be:
library(stringr)
set.seed(1)
df<-data.frame(var1=stri_rand_strings(100, length=3, '[A-C]'))
var<-c('A', 'B', 'C')
colSums(sapply(var, function(x,y)str_count(y, x), df$var1 ))
#A B C
#101 109 90
Upvotes: 0
Reputation: 5893
If you want to count ALL occurences (so also multiple within a single string), you can use str_count
from the stringr
package.
map_int(var, ~sum(stringr::str_count(df$var1, .)))
[1] 90 112 98
Otherwise, you can use str_detect
.
map_int(var, ~sum(stringr::str_detect(df$var1, .)))
[1] 66 71 70
Upvotes: 1
Reputation: 99351
You can do this easily by summing the columns after running grepl()
through sapply()
.
colSums(sapply(var, grepl, df$var1))
# A B C
# 72 72 69
Upvotes: 1