Kelsey Urgo
Kelsey Urgo

Reputation: 21

Count string followed by separate string occurrences in r

I am trying to count the number of occurrences of a string followed by another string in r. I cannot seem to get the regex worked out to count this correctly.

As an example:

v <- c("F", "F", "C", "F", "F", "C", "F", "F")
b <- str_count(v, "F(?=C)")

I would like b to tell me how many times string F was followed by string C in vector v (which should equal 2).

I have successfully implemented str_count() to count single strings, but I cannot figure out how to count string followed by a different string.

Also, I found that in regex (?=...) should indicated "followed by" however this does not seem to be sufficient.

Upvotes: 2

Views: 200

Answers (2)

GKi
GKi

Reputation: 39657

You don't have one string. You have individual strings. Her you can test if F is followed by C by shifting using [ for subsetting.

sum(v[-length(v)] == "F" & v[-1] == "C")
#sum(v == "F" & c(v[-1] == "C", FALSE)) #Alternative
#[1] 2

To use stringr::str_count you can paste v to one string.

stringr::str_count(paste(v, collapse = ""), "F(?=C)")
#[1] 2

And for rows of a data.frame:

set.seed(42)
v <- as.data.frame(matrix(sample(c("F", "C"), 25, TRUE), 5))
stringr::str_count(apply(v, 1, paste, collapse = ""), "F(?=C)")
#[1] 1 1 2 1 1

Upvotes: 2

David Robinson
David Robinson

Reputation: 78600

You can use lag() from dplyr:

library(dplyr)
sum(v == "C" & lag(v) == "F", na.rm = TRUE)

(The na.rm = TRUE is because the first value of lag(v) is NA).


Your comment notes that you're also interested in applying this across each row of a data frame. This can be done by pivoting the data to be longer, then applying a grouped mutate, then pivoting the data to be wider again. On an example dataset:

example <- tibble(id = 1:3,
                  s1 = c("F", "F", "F"),
                  s2 = c("C", "F", "C"),
                  s3 = c("C", "C", "F"),
                  s4 = c("F", "C", "C"))

example %>%
  pivot_longer(s1:s4) %>%
  group_by(id) %>%
  mutate(fc_count = sum(value == "C" & lag(value) == "F", na.rm = TRUE)) %>%
  ungroup() %>%
  pivot_wider(names_from = name, values_from = value)

Result:

# A tibble: 3 x 6
     id fc_count s1    s2    s3    s4   
  <int>    <int> <chr> <chr> <chr> <chr>
1     1        1 F     C     C     F    
2     2        1 F     F     C     C    
3     3        2 F     C     F     C    

Note that this assumed the data had something like an id column that uniquely identifies each original row. If it doesn't, you can add one with mutate(id = row_number()) first.

Upvotes: 1

Related Questions