Maximilian Christian
Maximilian Christian

Reputation: 43

Str_count: Problem with NAs and multiple occurance of similar words

I want to count the number of occurances of certain strings in a column using the function str_count. It works fine for lines in which only the correct expression is included. However, I get the result NA for lines which include one NA, and my column contains lots of NAs.

I've tried unsuccesffully to carry out this task with the summarize function of the tidyverse, utilizing the sum function and the %in% operator as well as regular comparisons. Sum and str_count has so far has given me the most promising results.

# library(tidyverse)

# Reproducible data frame similar to the one I am working on
# This should resemble long data for two participants, that each have two 
# codes in a column
test <- data.frame(name = c("A1", "A1", "B1", "B1"), code_2 = c("SF08", "SF03", "SF03", NA))

# Here is my analysis that counts the number of matches of a code
analysis <- test %>% 
  group_by(name) %>% 
  summarize(
       total_sf2 = sum(stringr::str_count(code_2, "SF"))
       )

I would expect two matches for participant A1 (which I get), and one match instead of the result NA for participant B2

Upvotes: 4

Views: 591

Answers (3)

akrun
akrun

Reputation: 886938

An option using grepl and data.table

library(data.table)
setDT(test)[, .(code_2 = sum(grepl("SF", code_2))), name]
#   name code_2
#1:   A1      2
#2:   B1      1

Upvotes: 0

jay.sf
jay.sf

Reputation: 72593

In base R you could use regexpr in aggregate which isn't affected by <NA>s.

aggregate(code_2 ~ name, test, function(x) sum(regexpr("SF", x)))
#   name code_2
# 1   A1      2
# 2   B1      1

Upvotes: 0

LyzandeR
LyzandeR

Reputation: 37879

Just add na.rm = TRUE in your sum call:

test %>% 
   group_by(name) %>% 
   summarize(
     total_sf2 = sum(stringr::str_count(code_2, "SF"), na.rm=TRUE)
   )

# A tibble: 2 x 2
#  name  total_sf2
#  <fct>     <int>
#1 A1            2
#2 B1            1

Upvotes: 1

Related Questions