Reputation: 43
I want to count the number of occurances of certain strings in a column using the function str_count. It works fine for lines in which only the correct expression is included. However, I get the result NA for lines which include one NA, and my column contains lots of NAs.
I've tried unsuccesffully to carry out this task with the summarize function of the tidyverse, utilizing the sum function and the %in% operator as well as regular comparisons. Sum and str_count has so far has given me the most promising results.
# library(tidyverse)
# Reproducible data frame similar to the one I am working on
# This should resemble long data for two participants, that each have two
# codes in a column
test <- data.frame(name = c("A1", "A1", "B1", "B1"), code_2 = c("SF08", "SF03", "SF03", NA))
# Here is my analysis that counts the number of matches of a code
analysis <- test %>%
group_by(name) %>%
summarize(
total_sf2 = sum(stringr::str_count(code_2, "SF"))
)
I would expect two matches for participant A1 (which I get), and one match instead of the result NA for participant B2
Upvotes: 4
Views: 591
Reputation: 886938
An option using grepl
and data.table
library(data.table)
setDT(test)[, .(code_2 = sum(grepl("SF", code_2))), name]
# name code_2
#1: A1 2
#2: B1 1
Upvotes: 0
Reputation: 72593
In base R you could use regexpr
in aggregate
which isn't affected by <NA>
s.
aggregate(code_2 ~ name, test, function(x) sum(regexpr("SF", x)))
# name code_2
# 1 A1 2
# 2 B1 1
Upvotes: 0
Reputation: 37879
Just add na.rm = TRUE
in your sum call:
test %>%
group_by(name) %>%
summarize(
total_sf2 = sum(stringr::str_count(code_2, "SF"), na.rm=TRUE)
)
# A tibble: 2 x 2
# name total_sf2
# <fct> <int>
#1 A1 2
#2 B1 1
Upvotes: 1