Summarise based on categorical runs

Question

I'm working on summarizing a large set of sensor data. I need to extract 1.) max run length of a particular category and 2.)summary statistics of all variables within the run.

For example data:

require(dplyr)
    fruit <- as.factor(c('apple','apple','banana','banana','banana','guava','guava','guava','guava','apple','apple','apple','banana','guava'))
    duration <- c(1,2,1,2,3,1,2,3,4,1,2,3,1,1)
    set.seed(14)
    temp <- round(runif(14, 80.0, 105.0))
    test <- data.frame(duration, fruit, temp)

#Example Data Frame
duration  fruit   temp
 1        apple   86
 2        apple   96
 1        banana  104
 2        banana  94
 3        banana  105
 1        guava   93
 2        guava   103
 3        guava   91
 4        guava   92
 1        apple   90
 2        apple   102
 3        apple   84
 1        banana  92
 1        guava   101

I can accomplish #1 by comparing each row with the row ahead to see if they're different.

However, this result also returns the single entry of Temp for the final row, and I'd like to be able to calculate various summaries, such as mean, on the Temp data.

test %>% filter((lead(`fruit`) != `fruit`)| is.na(lead(`fruit`)) )

Where I'd like to end up is a frame more like:

test %>%
  filter((lead(`fruit`) != `fruit`)| is.na(lead(`fruit`)) ) %>%
  select(-temp) %>%
  mutate(mean_temp = c(91,101,94.8,92,92,101))

##Goal Output
duration  fruit      mean_temp
2         apple      91.0
3         banana     101.0
4         guava      94.8
3         apple      92.0
1         banana     92.0
1         guava      101.0

Any ideas for how to do this efficiently?

Ronak Shah · Accepted Answer

We can create groups using lag and cumsum and then calculate statistics for each group.

library(dplyr)

test %>%
  group_by(group = cumsum(fruit != lag(fruit, default = first(fruit)))) %>%
  summarise(fruit = first(fruit), 
            duration = n(), 
            mean_temp = mean(temp)) %>%
  select(-group)

#  fruit  duration mean_temp
#            
#1 apple         2      91  
#2 banana        3     101  
#3 guava         4      94.8
#4 apple         3      92  
#5 banana        1      92  
#6 guava         1     101

The groups can also be created using data.table::rleid replacing the group_by line to

group_by(group = data.table::rleid(fruit))

Or using rle

group_by(group = with(rle(as.character(fruit)), rep(seq_along(values), lengths)))

Or using data.table

library(data.table)
setDT(test)[, .(duration = .N, fruit = fruit[1L], 
                mean_temp = mean(temp)), by = rleid(fruit)]

Summarise based on categorical runs

Answers (1)

Related Questions