Reputation: 539
How does one easily disaggregate quarterly data to daily data? In this case I'm using 10 years of US GDP data which have quarterly observations, and I want to expand the data frame to daily level, carrying over the GDP value each day until the next observation.
Reprex table:
structure(list(thedate = structure(c(14426, 14518, 14610, 14700,
14791, 14883, 14975, 15065, 15156, 15248, 15340, 15431, 15522,
15614, 15706, 15796, 15887, 15979, 16071, 16161, 16252, 16344,
16436, 16526, 16617, 16709, 16801, 16892, 16983, 17075, 17167,
17257, 17348, 17440, 17532, 17622, 17713, 17805, 17897, 17987
), class = "Date"), gdp = c(1.5, 4.5, 1.5, 3.7, 3, 2, -1, 2.9,
-0.1, 4.7, 3.2, 1.7, 0.5, 0.5, 3.6, 0.5, 3.2, 3.2, -1.1, 5.5,
5, 2.3, 3.2, 3, 1.3, 0.1, 2, 1.9, 2.2, 2, 2.3, 2.2, 3.2, 3.5,
2.5, 3.5, 2.9, 1.1, 3.1, 2.1)), class = "data.frame", row.names = c(NA,
-40L))
We see above:
2009-07-01 | 1.5
2009-10-01 | 4.5
The intended output would look like:
2009-07-01 | 1.5
2009-07-02 | 1.5
2009-07-03 | 1.5
etc.
2009-10-01 | 4.5
2009-10-02 | 4.5
2009-10-03 | 4.5
Upvotes: 4
Views: 882
Reputation: 11255
Here's a base solution:
last_quarter_end_date <- seq.Date(df$thedate[nrow(df)], by = 'quarter', length.out = 2)[-1]-1
seqs <- diff(c(df$thedate, last_quarter_end_date))
data.frame(thedate = rep(df$thedate, seqs) + sequence(seqs)-1
, gdp = rep(df$gdp, seqs))
Basically, the difference between dates is how many times you need to repeat a GDP column. Also, I can do seq_len()
for each difference to add back to the original date.
Performance This approach is efficient although I'll note that 0.6 ms isn't really much different than 15 ms in the big picture.
Unit: microseconds
expr min lq mean median uq max neval
cole_base 528.1 554.15 690.379 644.9 663.75 3225.7 100
d_b_base 15735.0 15994.40 17395.754 16243.9 18108.30 38761.8 100
Ben_tidyr 2808.7 2936.40 3356.324 3076.6 3149.65 8065.1 100
Complete code for reference:
DF <- structure(list(thedate = structure(c(14426, 14518, 14610, 14700,
14791, 14883, 14975, 15065, 15156, 15248, 15340, 15431, 15522,
15614, 15706, 15796, 15887, 15979, 16071, 16161, 16252, 16344,
16436, 16526, 16617, 16709, 16801, 16892, 16983, 17075, 17167,
17257, 17348, 17440, 17532, 17622, 17713, 17805, 17897, 17987
), class = "Date"), gdp = c(1.5, 4.5, 1.5, 3.7, 3, 2, -1, 2.9,
-0.1, 4.7, 3.2, 1.7, 0.5, 0.5, 3.6, 0.5, 3.2, 3.2, -1.1, 5.5,
5, 2.3, 3.2, 3, 1.3, 0.1, 2, 1.9, 2.2, 2, 2.3, 2.2, 3.2, 3.5,
2.5, 3.5, 2.9, 1.1, 3.1, 2.1)), class = "data.frame", row.names = c(NA,
-40L))
library(microbenchmark)
library(tidyr)
microbenchmark(cole_base = {
last_quarter_end_date <- seq.Date(DF$thedate[nrow(DF)], by = 'quarter', length.out = 2)[-1]-1
seqs <- diff(c(DF$thedate, last_quarter_end_date))
data.frame(thedate = rep(DF$thedate, seqs) + sequence(seqs)-1
, gdp = rep(DF$gdp, seqs))
}
, d_b_base = {
do.call(rbind, lapply(2:NROW(DF), function(i){
data.frame(date = head(seq.Date(DF$thedate[i-1], DF$thedate[i], "days"), -1),
gdp = DF$gdp[i - 1])
}))
}
, Ben_tidyr = {
DF %>%
complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
fill(gdp)
}
)
Upvotes: 1
Reputation: 30474
Here is a tidyr and zoo package answer that uses 'last observation carried forward' after inserting a sequence of dates with NA:
library(tidyverse)
library(zoo)
data %>%
complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
do(na.locf(.))
Edit: Thanks to Shree for reminding that tidyr::fill would eliminate need for zoo:
library(tidyverse)
data %>%
complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
fill(gdp)
Upvotes: 2