Reputation: 80
I have a dataframe with subjects and years, for each year there is only one subject. I want to create timespanes from the categories when the dates are continuous:
cat <- c("Cat1","Cat1","Cat2","Cat2","Cat2","Cat3","Cat2","Cat2","Cat2")
year <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
df <- data.frame(Cat=cat, Year=year)
# Which looks like the following:
# Cat1 2010
# Cat1 2011
# Cat2 2012
# Cat2 2013
# Cat2 2014
# Cat3 2015
# Cat2 2016
# Cat2 2017
# Cat2 2018
What I want as the output is a dataframe like this:
cat <- c("Cat1","Cat2","Cat3","Cat2")
year <- c(2010,2012,2015,2016)
e_year <- c(2011,2014,2015,2018)
df_goal <- data.frame(Cat=cat, Year=year, EYear = e_year)
# Cat Year EYear
# Cat1 2010 2011
# Cat2 2012 2014
# Cat3 2015 2015
# Cat2 2016 2018
I thought of doing it with a loop but I dont think thats the proper way to do it in R. So I wanted to ask before I spend time on that solution.
Upvotes: 1
Views: 76
Reputation: 71580
dplyr
:df %>% group_by(Cat, N=cumsum(Cat != lag(Cat, default=""))) %>%
summarize(SYear=min(Year), EYear=max(Year)) %>%
arrange(N) %>% select(-N)
Output:
Cat SYear EYear
<chr> <dbl> <dbl>
1 Cat1 2010 2011
2 Cat2 2012 2014
3 Cat3 2015 2015
4 Cat2 2016 2018
Upvotes: 1
Reputation: 160437
This is strongly related to both Calculate the mean by group (summarize by group) and Get first and last value from groups using rle, though slightly different from each.
out <- aggregate(Year ~ Cat + grp, data = df, FUN = range)
out <- do.call(cbind.data.frame, out[,-2])
names(out)[2:3] <- c("Year", "EYear")
out
# Cat Year EYear
# 1 Cat1 2010 2011
# 2 Cat2 2012 2014
# 3 Cat3 2015 2015
# 4 Cat2 2016 2018
library(dplyr)
df %>%
group_by(grp = cumsum(Cat != lag(Cat, default = ""))) %>%
summarize(Cat = Cat[1], EYear = max(Year), Year = min(Year)) %>%
ungroup() %>%
select(-grp)
# # A tibble: 4 x 3
# Cat EYear Year
# <chr> <dbl> <dbl>
# 1 Cat1 2011 2010
# 2 Cat2 2014 2012
# 3 Cat3 2015 2015
# 4 Cat2 2018 2016
library(data.table)
as.data.table(df)[, .(Cat = Cat[1], EYear = max(Year), Year = min(Year)), by = .(grp = rleid(Cat))
][, grp := NULL]
# Cat EYear Year
# <char> <num> <num>
# 1: Cat1 2011 2010
# 2: Cat2 2014 2012
# 3: Cat3 2015 2015
# 4: Cat2 2018 2016
Upvotes: 2