Sebastian Hubard
Sebastian Hubard

Reputation: 163

Summarize Dates into Varying Groups

I have a variable that provides miscellaneous dates. I want to summarize these so they can be factored before being used in a predictive model.

I would like to do group the dates by the following:

I'm pretty new to R so any help on this would be much appreciated. Thank you

Upvotes: 0

Views: 49

Answers (1)

semaphorism
semaphorism

Reputation: 866

As other commenters have noted, you haven't supplied any data or a reproducible example, but let's give this a go anyway.

I'll be using two tidyverse packages, dplyr and lubridate, to help us out.

For present purposes, let's start by generating some random dates and put these into a dataframe/tibble. I'm assuming your dates are already within a dataframe in the right class, as Gregor pointed out above.

data <- tibble(date = sample(seq(as.Date('2015-01-01'), as.Date('2020-12-31'), by="day"), 50))

Let's now use dplyr and lubridate to recode the dates into a new variable, date_group:

data %>%
  mutate(date_group = factor(
    case_when(
      year(date) == year(today()) ~ "This Year",
      year(date) == year(today()) - 1 ~ "Last Year",
      year(date) < today() - years(3) ~ "Over 3 Years Ago",
      TRUE ~ "Other"
    )
  ))

For the first two groups, we apply use the lubridate function year() (which extracts the year from a date) to the date column in data, and compare this against the year extracted from today's date (using today()).

For dates over 3 years ago, we subtract 3 years from today's date (noting that this is different from the calendar-year based calculations for this year and last year) using years().

Of course, this leaves a gap for dates less than 3 years ago but more than 1 calendar year ago. We have a default option in the case_when function to specify this as "Other".

We wrap the result of the case_when function in factor() so that the resulting groups are treated as a factor rather than a string ready for subsequent modelling.

The case_when function is useful (and easy to read) if you have just a few categories. Too many and it gets too messy and you should think about another way to restructure your data.

Upvotes: 1

Related Questions