pd441
pd441

Reputation: 2763

Count the number of occurances of multiple letters in a variable within a dataframe?

Just say I want to cont the number of "a"'s and "p"'s in the word "apple", I can do:

library(stringr)
sum(str_count("apple", c("b", "k")))

but when I try to apply this logic in order to count the number of "a"'s and "p"'s across multiple different words in a variable in a dataframe, it doesn't work, e.g.:

dat <- tibble(id = 1:4, word = c(c("apple", "banana", "pear", "pineapple")))
dat <- dat %>% mutate(num_ap = sum(str_count(word, c("a", "p"))))

it doesn't work. I the variable "num_ap" should read c(3, 3, 2, 4) but instead it reads c(5, 5, 5, 5)

Does anyone know why this isn't working for me?

Thanks!

Upvotes: 2

Views: 419

Answers (4)

akrun
akrun

Reputation: 887088

Using base R

dat$num_ap <-  nchar(gsub("[^ap]", "", dat$word))

-output

> dat
  id      word num_ap
1  1     apple      3
2  2    banana      3
3  3      pear      2
4  4 pineapple      4

data

dat <- structure(list(id = 1:4, word = c("apple", "banana", "pear", 
"pineapple")), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 1

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Two solutions (both without sum):

with rowwise():

library(dplyr)
library(stringr)
dat %>%
  rowwise() %>%
  mutate(num_ap = str_count(word, "a|p"))
  id      word num_ap
1  1     apple      3
2  2    banana      3
3  3      pear      2
4  4 pineapple      4

with lengths and str_extract_all:

library(dplyr)
library(stringr)
dat %>%
  mutate(num_ap = lengths(str_extract_all(word, "a|p")))
  id      word num_ap
1  1     apple      3
2  2    banana      3
3  3      pear      2
4  4 pineapple      4

Upvotes: 2

Fabian B
Fabian B

Reputation: 171

In cases like this it helps to backtrack the issue.

str_count(dat$word, c("a", "p")) by itself will return [1] 1 0 1 3. Each number represents the number of times the letter 'p' appears in each word in your data frame. If you take the sum of that vector with sum(str_count(dat$word, c("a", "p"))), you get [1] 5. Since you are not going row by row, each row will be assigned a value of 5, which is consistent with your results.

To fix this, note that the function rowwise() (part of the dplyr library) allows you to do work with each row individually. Hence, modifying your code to incorporate the rowwise() function will solve your problem:

dat <- dat %>% rowwise() %>% mutate(num_ap = sum(str_count(word, c("a", "p"))))

Upvotes: 3

Jan
Jan

Reputation: 5254

sapply the transformation to each element of dat$word

library(stringr)
dat <- data.frame(id = 1:4, word = c(c("apple", "banana", "pear", "pineapple")))
dat$num_ap <- sapply(dat$word, function(x) sum(str_count(x, c("a", "p"))))

dat
#>   id      word num_ap
#> 1  1     apple      3
#> 2  2    banana      3
#> 3  3      pear      2
#> 4  4 pineapple      4

Created on 2021-10-14 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions