Reputation: 968
I have this sort of data (simplified obviously):
Var1 Var2 Var3
20 0.4 a
50 0.5 a
80 0.6 b
150 0.3 a
250 0.4 b
I want to group them according to Var1 if they fall into an interval of 50, then get the mean of Var1 and Var2, and keep Var3 as is if it's homogeneous, or rename it if the group has mixed labels. In this case I would get:
Var1 Var2 Var3
50 0.5 mixed
150 0.3 a
250 0.4 b
I'm guessing I should use the group_by
function from dplyr
package but I don't know how exactly. Thanks for your help!
Upvotes: 1
Views: 98
Reputation: 39858
Another dplyr
possibility could be:
df %>%
group_by(grp = cumsum(Var1 - lag(Var1, default = first(Var1)) > 50)) %>%
summarise(Var1 = mean(Var1),
Var2 = mean(Var2),
Var3 = ifelse(n_distinct(Var3) > 1, "mixed", Var3)) %>%
ungroup() %>%
select(-grp)
Var1 Var2 Var3
<dbl> <dbl> <chr>
1 50 0.5 mixed
2 150 0.3 a
3 250 0.4 b
Upvotes: 2
Reputation: 1313
here's the dataframe with dput
d <- structure(list(Var1 = c(20L, 50L, 80L, 150L, 250L), Var2 = c(0.4,
0.5, 0.6, 0.3, 0.4), Var3 = structure(c(1L, 1L, 2L, 1L, 2L), .Label = c("a",
"b"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
I'd
in the tidyverse this might look like
d %>%
# make sure we sort Var1
arrange(Var1) %>%
# increment var1 by 50 and test that against the next row
# if the next value exceeds current by 50, we mark it as a new group
mutate(nextint=Var1+50,
newgroup=Var1>lag(nextint,default=-Inf),
grp=cumsum(newgroup)) %>%
# for each group, get the mean and a comma separated list of distinct Var3 values
group_by(grp) %>%
summarise(
grplbl=floor(max(Var1)/50)*50,
mu=mean(Var2),
mix=paste(collapse=",",unique(Var3))) %>%
# if mix (distinct Var3) has a comma in it, change from e.g. 'a,b' to 'mix'
mutate(mix=ifelse(grepl(',', mix), 'mixed', mix))
# A tibble: 3 x 4
grp grplbl mu mix
<int> <dbl> <dbl> <chr>
1 1 50 0.5 mixed
2 2 150 0.3 a
3 3 250 0.4 b
Upvotes: 1