Laura
Laura

Reputation: 519

Computationally efficient ways for working with dates in large datasets in lubridate in R

I have data which looks like this, but is 20 million rows.

library(tidyr)
library(dplyr)
library(stringr)
library(magrittr)
library(lubridate)
library(tidyverse)


df <- data.frame(
  DATE_OF_BIRTH = c("1933-03-31", "1947-06-25", "1901-09-02", "1952-01-22", "1936-07-18", "2020-10-22", "1930-05-18", "1926-05-13"),
  DATE_OF_DEATH = c(NA, "2019-02-04", "2017-10-27", NA, "2021-01-03", NA, NA, NA),
) 

What I want to do is

A) figure out how old people are, as at 31 December 2019; and divide them into age groupings

B) remove people with impossible ages or dates of death

This is the code I am running to do this

#Change the missing dates of death into a format recognisable as a date, which is far into the future
df %<>%
  replace_na(list(DATE_OF_DEATH = "01/01/9999"))

#Specify the start and end date of the year of interest
end_yr_date = dmy('31/12/2019')
start_yr_date = dmy('01/01/2019')

df %<>%
  #create age
  mutate(age = floor(interval(start = dmy(DATE_OF_BIRTH), end = end_yr_date) / 
                       duration(num = 1, units = "years"))) %>%
  #and age groupings
  mutate(age_group = cut(age, 
                         breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 150), 
                         labels = c("00-04",'05-09','10-14',"15-19", "20-24", "25-29", "30-34", "35-39", "40-44", 
                                    "45-49", "50-54","55-59", "60-64", "65-69",
                                    "70-74", "75-79", "80-84", "85+"), right = FALSE))


df %<>%
  #remove people who were born after end date
  filter(!(dmy(DATE_OF_BIRTH) > end_yr_date)) %>%
  #remove people who died before start date
  filter(!(dmy(DATE_OF_DEATH) < start_yr_date)) %>%
  #Remove people with a negative age
  filter(age >= 0) %>%
  #Remove people older than 115
  filter(age < 116)

This runs fine on this sample dataset, but it just keeps running and running and running on the 20 million rows of data. I am wondering if there are ways to work with dates which are more computationally efficient, and maybe faster?

I'm also wondering if I might have date formats which won't parse (I have removed NA dates, but maybe there are other data entry errors which aren't in the right format) which is why the code just keeps running. Does anyone know of an efficient way to determine any date formats which wouldn't parse (which aren't NA)?

Thanks for your help.

Upvotes: 0

Views: 122

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389255

You can change the columns to date class once and include all the filter expression into one.

library(dplyr)
library(lubridate)

df %>%
  mutate(across(c(DATE_OF_BIRTH, DATE_OF_DEATH), ymd),
         age = floor(interval(start = DATE_OF_BIRTH, end = end_yr_date) / 
                       duration(num = 1, units = "years")), 
         age_group = cut(age, 
                         breaks = c(seq(0, 85, 5), 150), 
                         labels = c("00-04",'05-09','10-14',"15-19", "20-24", "25-29", "30-34", "35-39", "40-44", 
                                    "45-49", "50-54","55-59", "60-64", "65-69",
                                    "70-74", "75-79", "80-84", "85+"), right = FALSE)) %>%
  filter(DATE_OF_BIRTH < end_yr_date, DATE_OF_DEATH > start_yr_date, 
         between(age, 0, 116)) -> result

If this is still slow you can switch to data.table.

library(data.table)

setDT(df)
df[, c('DATE_OF_BIRTH', 'DATE_OF_DEATH') := lapply(.SD, ymd), .SDcols = c('DATE_OF_BIRTH', 'DATE_OF_DEATH')] %>%
  .[, age := floor(interval(start = DATE_OF_BIRTH, end = end_yr_date) / 
                duration(num = 1, units = "years"))] %>%
  .[, age_group := cut(age, 
                breaks = c(seq(0, 85, 5), 150), 
                labels = c("00-04",'05-09','10-14',"15-19", "20-24", "25-29", "30-34", "35-39", "40-44", 
                           "45-49", "50-54","55-59", "60-64", "65-69",
                           "70-74", "75-79", "80-84", "85+"), right = FALSE)] %>%
  .[DATE_OF_BIRTH < end_yr_date & DATE_OF_DEATH > start_yr_date & between(age, 0, 116)] -> result

Upvotes: 3

Related Questions