Reputation: 519
I have data which looks like this, but is 20 million rows.
library(tidyr)
library(dplyr)
library(stringr)
library(magrittr)
library(lubridate)
library(tidyverse)
df <- data.frame(
DATE_OF_BIRTH = c("1933-03-31", "1947-06-25", "1901-09-02", "1952-01-22", "1936-07-18", "2020-10-22", "1930-05-18", "1926-05-13"),
DATE_OF_DEATH = c(NA, "2019-02-04", "2017-10-27", NA, "2021-01-03", NA, NA, NA),
)
What I want to do is
A) figure out how old people are, as at 31 December 2019; and divide them into age groupings
B) remove people with impossible ages or dates of death
This is the code I am running to do this
#Change the missing dates of death into a format recognisable as a date, which is far into the future
df %<>%
replace_na(list(DATE_OF_DEATH = "01/01/9999"))
#Specify the start and end date of the year of interest
end_yr_date = dmy('31/12/2019')
start_yr_date = dmy('01/01/2019')
df %<>%
#create age
mutate(age = floor(interval(start = dmy(DATE_OF_BIRTH), end = end_yr_date) /
duration(num = 1, units = "years"))) %>%
#and age groupings
mutate(age_group = cut(age,
breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 150),
labels = c("00-04",'05-09','10-14',"15-19", "20-24", "25-29", "30-34", "35-39", "40-44",
"45-49", "50-54","55-59", "60-64", "65-69",
"70-74", "75-79", "80-84", "85+"), right = FALSE))
df %<>%
#remove people who were born after end date
filter(!(dmy(DATE_OF_BIRTH) > end_yr_date)) %>%
#remove people who died before start date
filter(!(dmy(DATE_OF_DEATH) < start_yr_date)) %>%
#Remove people with a negative age
filter(age >= 0) %>%
#Remove people older than 115
filter(age < 116)
This runs fine on this sample dataset, but it just keeps running and running and running on the 20 million rows of data. I am wondering if there are ways to work with dates which are more computationally efficient, and maybe faster?
I'm also wondering if I might have date formats which won't parse (I have removed NA dates, but maybe there are other data entry errors which aren't in the right format) which is why the code just keeps running. Does anyone know of an efficient way to determine any date formats which wouldn't parse (which aren't NA)?
Thanks for your help.
Upvotes: 0
Views: 122
Reputation: 389255
You can change the columns to date class once and include all the filter
expression into one.
library(dplyr)
library(lubridate)
df %>%
mutate(across(c(DATE_OF_BIRTH, DATE_OF_DEATH), ymd),
age = floor(interval(start = DATE_OF_BIRTH, end = end_yr_date) /
duration(num = 1, units = "years")),
age_group = cut(age,
breaks = c(seq(0, 85, 5), 150),
labels = c("00-04",'05-09','10-14',"15-19", "20-24", "25-29", "30-34", "35-39", "40-44",
"45-49", "50-54","55-59", "60-64", "65-69",
"70-74", "75-79", "80-84", "85+"), right = FALSE)) %>%
filter(DATE_OF_BIRTH < end_yr_date, DATE_OF_DEATH > start_yr_date,
between(age, 0, 116)) -> result
If this is still slow you can switch to data.table
.
library(data.table)
setDT(df)
df[, c('DATE_OF_BIRTH', 'DATE_OF_DEATH') := lapply(.SD, ymd), .SDcols = c('DATE_OF_BIRTH', 'DATE_OF_DEATH')] %>%
.[, age := floor(interval(start = DATE_OF_BIRTH, end = end_yr_date) /
duration(num = 1, units = "years"))] %>%
.[, age_group := cut(age,
breaks = c(seq(0, 85, 5), 150),
labels = c("00-04",'05-09','10-14',"15-19", "20-24", "25-29", "30-34", "35-39", "40-44",
"45-49", "50-54","55-59", "60-64", "65-69",
"70-74", "75-79", "80-84", "85+"), right = FALSE)] %>%
.[DATE_OF_BIRTH < end_yr_date & DATE_OF_DEATH > start_yr_date & between(age, 0, 116)] -> result
Upvotes: 3