rioualen
rioualen

Reputation: 968

How to group data into not pre-determined intervals using R and dplyr?

I have this sort of data (simplified obviously):

Var1 Var2 Var3
20   0.4  a
50   0.5  a
80   0.6  b
150  0.3  a
250  0.4  b

I want to group them according to Var1 if they fall into an interval of 50, then get the mean of Var1 and Var2, and keep Var3 as is if it's homogeneous, or rename it if the group has mixed labels. In this case I would get:

Var1 Var2 Var3
50   0.5  mixed
150  0.3  a
250  0.4  b

I'm guessing I should use the group_by function from dplyr package but I don't know how exactly. Thanks for your help!

Upvotes: 1

Views: 98

Answers (2)

tmfmnk
tmfmnk

Reputation: 39858

Another dplyr possibility could be:

df %>%
 group_by(grp = cumsum(Var1 - lag(Var1, default = first(Var1)) > 50)) %>%
 summarise(Var1 = mean(Var1),
           Var2 = mean(Var2),
           Var3 = ifelse(n_distinct(Var3) > 1, "mixed", Var3)) %>%
 ungroup() %>%
 select(-grp)

   Var1  Var2 Var3 
  <dbl> <dbl> <chr>
1    50   0.5 mixed
2   150   0.3 a    
3   250   0.4 b  

Upvotes: 2

Will
Will

Reputation: 1313

here's the dataframe with dput

d <- structure(list(Var1 = c(20L, 50L, 80L, 150L, 250L), Var2 = c(0.4, 
0.5, 0.6, 0.3, 0.4), Var3 = structure(c(1L, 1L, 2L, 1L, 2L), .Label = c("a", 
"b"), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))

I'd

  1. create a few temporary columns to determine when a new group starts
  2. group and calculate mean, but also track distinct values of Var3
  3. change to mix if more than one Var3 value in a group

in the tidyverse this might look like

d %>% 
 # make sure we sort Var1
 arrange(Var1) %>% 
 # increment var1 by 50 and test that against the next row
 # if the next value exceeds current by 50, we mark it as a new group
 mutate(nextint=Var1+50, 
       newgroup=Var1>lag(nextint,default=-Inf), 
       grp=cumsum(newgroup)) %>%
 # for each group, get the mean and a comma separated list of distinct Var3 values
 group_by(grp) %>% 
 summarise(
           grplbl=floor(max(Var1)/50)*50,
           mu=mean(Var2), 
           mix=paste(collapse=",",unique(Var3))) %>%
 # if mix (distinct Var3) has a comma in it, change from e.g. 'a,b' to 'mix'
 mutate(mix=ifelse(grepl(',', mix), 'mixed', mix))
# A tibble: 3 x 4
    grp grplbl    mu mix  
  <int>  <dbl> <dbl> <chr>
1     1     50   0.5 mixed
2     2    150   0.3 a    
3     3    250   0.4 b  

Upvotes: 1

Related Questions