user16543787
user16543787

Reputation: 80

R function to convert dates in rows to timespan

I have a dataframe with subjects and years, for each year there is only one subject. I want to create timespanes from the categories when the dates are continuous:

cat   <- c("Cat1","Cat1","Cat2","Cat2","Cat2","Cat3","Cat2","Cat2","Cat2")
year <- c(2010,2011,2012,2013,2014,2015,2016,2017,2018)
df <- data.frame(Cat=cat, Year=year)
#  Which looks like the following:
# Cat1 2010
# Cat1 2011
# Cat2 2012
# Cat2 2013
# Cat2 2014
# Cat3 2015
# Cat2 2016
# Cat2 2017
# Cat2 2018

What I want as the output is a dataframe like this:

cat   <- c("Cat1","Cat2","Cat3","Cat2")
year <- c(2010,2012,2015,2016)
e_year <- c(2011,2014,2015,2018)
df_goal <- data.frame(Cat=cat, Year=year, EYear = e_year)
#  Cat Year EYear
# Cat1 2010  2011
# Cat2 2012  2014
# Cat3 2015  2015
# Cat2 2016  2018

I thought of doing it with a loop but I dont think thats the proper way to do it in R. So I wanted to ask before I spend time on that solution.

Upvotes: 1

Views: 76

Answers (2)

U13-Forward
U13-Forward

Reputation: 71580

With dplyr:

df %>% group_by(Cat, N=cumsum(Cat != lag(Cat, default=""))) %>%
  summarize(SYear=min(Year), EYear=max(Year)) %>%
  arrange(N) %>% select(-N)

Output:

  Cat   SYear EYear
  <chr> <dbl> <dbl>
1 Cat1   2010  2011
2 Cat2   2012  2014
3 Cat3   2015  2015
4 Cat2   2016  2018

Upvotes: 1

r2evans
r2evans

Reputation: 160437

This is strongly related to both Calculate the mean by group (summarize by group) and Get first and last value from groups using rle, though slightly different from each.

base R

out <- aggregate(Year ~ Cat + grp, data = df, FUN = range)
out <- do.call(cbind.data.frame, out[,-2])
names(out)[2:3] <- c("Year", "EYear")
out
#    Cat Year EYear
# 1 Cat1 2010  2011
# 2 Cat2 2012  2014
# 3 Cat3 2015  2015
# 4 Cat2 2016  2018

dplyr

library(dplyr)
df %>%
  group_by(grp = cumsum(Cat != lag(Cat, default = ""))) %>%
  summarize(Cat = Cat[1], EYear = max(Year), Year = min(Year)) %>%
  ungroup() %>%
  select(-grp)
# # A tibble: 4 x 3
#   Cat   EYear  Year
#   <chr> <dbl> <dbl>
# 1 Cat1   2011  2010
# 2 Cat2   2014  2012
# 3 Cat3   2015  2015
# 4 Cat2   2018  2016

data.table

library(data.table)
as.data.table(df)[, .(Cat = Cat[1], EYear = max(Year), Year = min(Year)), by = .(grp = rleid(Cat))
  ][, grp := NULL]
#       Cat EYear  Year
#    <char> <num> <num>
# 1:   Cat1  2011  2010
# 2:   Cat2  2014  2012
# 3:   Cat3  2015  2015
# 4:   Cat2  2018  2016

Upvotes: 2

Related Questions