Reputation: 53
I'm working on summarizing a large set of sensor data. I need to extract 1.) max run length of a particular category and 2.)summary statistics of all variables within the run.
For example data:
require(dplyr)
fruit <- as.factor(c('apple','apple','banana','banana','banana','guava','guava','guava','guava','apple','apple','apple','banana','guava'))
duration <- c(1,2,1,2,3,1,2,3,4,1,2,3,1,1)
set.seed(14)
temp <- round(runif(14, 80.0, 105.0))
test <- data.frame(duration, fruit, temp)
#Example Data Frame
duration fruit temp
1 apple 86
2 apple 96
1 banana 104
2 banana 94
3 banana 105
1 guava 93
2 guava 103
3 guava 91
4 guava 92
1 apple 90
2 apple 102
3 apple 84
1 banana 92
1 guava 101
I can accomplish #1 by comparing each row with the row ahead to see if they're different.
However, this result also returns the single entry of Temp for the final row, and I'd like to be able to calculate various summaries, such as mean, on the Temp data.
test %>% filter((lead(`fruit`) != `fruit`)| is.na(lead(`fruit`)) )
Where I'd like to end up is a frame more like:
test %>%
filter((lead(`fruit`) != `fruit`)| is.na(lead(`fruit`)) ) %>%
select(-temp) %>%
mutate(mean_temp = c(91,101,94.8,92,92,101))
##Goal Output
duration fruit mean_temp
2 apple 91.0
3 banana 101.0
4 guava 94.8
3 apple 92.0
1 banana 92.0
1 guava 101.0
Any ideas for how to do this efficiently?
Upvotes: 0
Views: 51
Reputation: 388982
We can create groups using lag
and cumsum
and then calculate statistics for each group.
library(dplyr)
test %>%
group_by(group = cumsum(fruit != lag(fruit, default = first(fruit)))) %>%
summarise(fruit = first(fruit),
duration = n(),
mean_temp = mean(temp)) %>%
select(-group)
# fruit duration mean_temp
# <fct> <int> <dbl>
#1 apple 2 91
#2 banana 3 101
#3 guava 4 94.8
#4 apple 3 92
#5 banana 1 92
#6 guava 1 101
The groups can also be created using data.table::rleid
replacing the group_by
line to
group_by(group = data.table::rleid(fruit))
Or using rle
group_by(group = with(rle(as.character(fruit)), rep(seq_along(values), lengths)))
Or using data.table
library(data.table)
setDT(test)[, .(duration = .N, fruit = fruit[1L],
mean_temp = mean(temp)), by = rleid(fruit)]
Upvotes: 4