Marina
Marina

Reputation: 143

Grouping rows with same values

I am working with extreme dry events and trying to obtain some information about their characteristics. This is an example of my data:

   dat <- data.frame(length= c(39,1,1,1,98,1,1,1,57,1,1,1,34,1,1), value = c(0,-1.111,-1.645,-1.285,0,-1.223,-1.369,-1.007,0,-1.083,-1.675,-1.119,0,-1.554,-1.6228))

Rows are months, thus column 'length' identifies a dry month with the number 1, for its part, column 'value' records the severity of that dry event. What I would like to obtain is, on one hand the median and maximum length of dry events but considering each group of consecutive dry months (length = 1) as an event (red boxes); on the other hand I would like to calculate the median and minimum values of the severity of all the dry events in the serie.

This screenshot shows what I am trying to get and the values I expect to obtain.

enter image description here

My main question is how I can considerate the groups of consecutive rows with value 1 in column 'length' as an unique case and calculate these simple statistics.

Thank you so much in advance for any help provided.

Upvotes: 3

Views: 115

Answers (2)

DeduciveR
DeduciveR

Reputation: 1702

A fully dplyr solution:

library(tidyverse)
dat2 <- dat %>%
  mutate(zero_count = if_else(value == 0, 1, 0), 
         group_id = cumsum(zero_count)) %>% 
  filter(length == 1) %>% 
  group_by(group_id) %>%
  summarise(mean = mean(value), median = median(value), months = length(group_id))

Firstly I added a new column to flag if there's a zero-value. Then a new column which we cumulatively sum so we can identify each group sequentially / individually. Then group by this new column and filter out the other rows that don't belong. Finally summarise to get the stats.

  group_id  mean median months
     <dbl> <dbl>  <dbl>  <int>
1        1 -1.35  -1.28      3
2        2 -1.20  -1.22      3
3        3 -1.29  -1.12      3
4        4 -1.59  -1.59      2

Upvotes: 1

akrun
akrun

Reputation: 887881

One option would be to create a grouping variable with run-length-id (rleid) and then use that to summarise the 'value' for median, min and other statistics of interest (i.e. number of rows - n())

library(dplyr)
library(data.table)
dat %>% 
    group_by(grp = rleid(length == 1)) %>% 
    filter(length == 1) %>% 
    summarise(Length = n(), Median = median(value), Min = min(value))

Or similar way with data.table by first creating a grouping variable with rleid, grouped by the 'grp' and specifying the i with the logical expression to subset the rows that are only equal to 1 in 'length',get the median and min (or max) in 'value' column

library(data.table)
setDT(dat)[, grp := rleid(length==1)][length == 1, 
   .(Length = .N, Median = median(value), Min = min(value)), .(grp)]

Upvotes: 1

Related Questions