onlyjust17
onlyjust17

Reputation: 135

Count string by group (R)

In annual grouping, I would like to get the number of times a string appears in multiple variables (columns).

year <- c("1993", "1994", "1995")
var1 <- c("tardigrades are usually about 0.5 mm long when fully grown.", "slow steppers", "easy") 
var2 <- c("something", "polar bear", "tardigrades are prevalent in mosses and lichens and feed on plant cells")
var3 <- c("kleiner wasserbaer", "newly hatched", "happy learning")
tardigrades <- data.frame(year, var1, var2, var3)
      
      
count_year <- tardigrades %>%
group_by(year) %>%
summarize(count = sum(str_count(tardigrades, 'tardigrades')))

Unfortunately, the total sum is added to each year with this solution. What am I doing wrong?

Upvotes: 1

Views: 704

Answers (1)

r2evans
r2evans

Reputation: 160407

You should (almost) never use the original frame (tardigrades) in a dplyr pipe. If you want to operate on most or all columns, then you need to be using some aggregating or iterating function and be explicit about the columns (e.g., everything() in tidyselect-speak).

Two suggestions for how to approach this:

  1. Pivot and summar

    library(dplyr)
    library(tidyr) # pivot_longer
    pivot_longer(tardigrades, -year) %>%
      group_by(year) %>%
      summarize(count = sum(grepl("tardigrades", value)))
    # # A tibble: 3 x 2
    #   year  count
    #   <chr> <int>
    # 1 1993      1
    # 2 1994      0
    # 3 1995      1
    
  2. Sum across the columns, this must be done rowwise (and not by year):

    tardigrades %>%
      rowwise() %>%
      summarize(
        year,
        count = sum(grepl("tardigrades", c_across(-year)))
        .groups = "drop")
    # # A tibble: 3 x 2
    #   year  count
    #   <chr> <int>
    # 1 1993      1
    # 2 1994      0
    # 3 1995      1
    

Upvotes: 2

Related Questions