logjammin
logjammin

Reputation: 1211

In R, use two date columns to calculate the average age of ID's at first event

Background

I've got an R dataframe, d:

d <- data.frame(ID = c("a","a","b","b", "c","c","c"),
                birthdate = as.Date(c("1980-01-01","1980-01-01","2000-12-23","2000-12-23","1949-03-14","1949-03-14","1949-03-14")),
                event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
                stringsAsFactors=FALSE)

It consists of an ID code and two dates: a birthdate and an event_date. Everyone's got a consistent birthdate, but people have multiple events each, all of these occurring on different dates.

The Problem

I'm trying to calculate the average age of people (IDs) in d at their first event. In other words, I'd like to get R to calculate an "age at the first event" by subtracting each ID's first event from their birthdate, and then summing them and dividing by n (3, in this case).

The answer (if my arithmetic isn't too far off this late at night) should be ~35.3 years old.

What I've tried

I'm not too familiar with date work in R, so I've only gotten so far as mutating a new column that calculates the difference between event_date and birthdate for that row:

d <- d %>% 
  mutate(date_difference = (event_date-birthdate)/365)

But I'm still a ways away from my summary calculation. I'm mainly hung up on how to tell R to find the first date difference for each ID. (Not to mention that dividing by 365 gives me correct years but they're still labeled "days" in the resulting df.)

Upvotes: 1

Views: 771

Answers (2)

dipetkov
dipetkov

Reputation: 3700

As an alternative to dividing by 365, you can use the lubridate::time_length function. It computes the length of a period in different time units (seconds, minutes, days, years).

library("tidyverse")

dat <- data.frame(
  ID = c("a", "a", "b", "b", "c", "c", "c"),
  birthdate = as.Date(c("1980-01-01", "1980-01-01", "2000-12-23", "2000-12-23", "1949-03-14", "1949-03-14", "1949-03-14")),
  event_date = as.Date(c("2011-01-01", "2012-08-21", "2011-12-23", "2011-12-31", "2013-03-14", "2013-04-07", "2014-07-14")),
  stringsAsFactors = FALSE
)

dat_with_age <- dat %>%
  group_by(ID) %>%
  slice_min(
    event_date,
    n = 1
  ) %>%
  ungroup() %>%
  mutate(
    age_at_first_event = lubridate::time_length(event_date - birthdate, unit = "year")
  )
dat_with_age
#> # A tibble: 3 × 4
#>   ID    birthdate  event_date age_at_first_event
#>   <chr> <date>     <date>                  <dbl>
#> 1 a     1980-01-01 2011-01-01               31.0
#> 2 b     2000-12-23 2011-12-23               11.0
#> 3 c     1949-03-14 2013-03-14               64

dat_with_age %>%
  summarise(
    mean(age_at_first_event)
  )
#> # A tibble: 1 × 1
#>   `mean(age_at_first_event)`
#>                        <dbl>
#> 1                       35.3

Created on 2022-03-11 by the reprex package (v2.0.1)

Upvotes: 1

Quinten
Quinten

Reputation: 41357

You can use this code:

d <- d %>%
  group_by(ID) %>%
  arrange(event_date) %>%
  slice(1) %>%
  mutate(date_difference = as.numeric((event_date-birthdate)/365)) %>%
  ungroup() %>%
  mutate(average_age = mean(date_difference))

Output:

# A tibble: 3 × 5
  ID    birthdate  event_date date_difference average_age
  <chr> <date>     <date>               <dbl>       <dbl>
1 a     1980-01-01 2011-01-01            31.0        35.4
2 b     2000-12-23 2011-12-23            11.0        35.4
3 c     1949-03-14 2013-03-14            64.0        35.4

Upvotes: 1

Related Questions