Reputation: 331
Other questions have centered around having a start and end date. (see the following for examples Given start date and end date, reshape/expand data for each day between (each day on a row) Expand rows by date range using start and end date
My question is different in that I only have one date column and I would like to convert the unequal date ranges to daily counts. This specific example created deals with number of workers on a job site at one time. Different crews of people come on different dates
A brief data frame provided is as follows:
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
I would like the data to look like this:
dw <- data.frame(date=c(seq(as.Date("1999-03-22"),as.Date("1999-04-10"),by= "day")),
work=c(rep(43,7),rep(95,10),rep(92,3)),
cumwork=c(rep(43,7),rep(138,10),rep(230,3)))
I have been stuck on this for some time. Any help would be appreciated!
UPDATE (7/5/2017): As pointed out by @Scarabee the dates in the dataframe 'dd' should be in date format. Have updated the code to reflect this
Upvotes: 3
Views: 210
Reputation: 7608
Here is a really fast pure base R solution:
ExpandDates <- function(df, lastColRepeat) {
myDiff <- diff(df$date)
dfOut <- data.frame(df$date[1] + 0:(sum(myDiff) + lastColRepeat - 1L),
stringsAsFactors=FALSE)
myDiff <- c(myDiff, lastColRepeat)
for (i in 2:3) {dfOut[,i] <- rep(df[ ,i], times = myDiff)}
names(dfOut) <- names(df)
dfOut
}
The last argument is to determine the number of times the last value should be repeated. As it stands, there is nothing in the original data.frame that would give this value. I'm also assuming that the "date" field is actually a date as pointed out by @Scarabee.
Here is some test data:
set.seed(123)
workVec <- sample(5000, 3000)
testDF <- data.frame(date = as.Date(sort(sample(12000, 3000)),
origin = "1970-01-01"), work = workVec,
cumwork = cumsum(workVec))
DplyrTest <- function(dd) { ## from @Scarabee
v <- data.frame(date = seq(min(dd$date), max(dd$date), by="day"))
v %>%
left_join(dd, by = "date") %>%
na.locf
}
a <- ExpandDates(testDF, 1)
b <- DplyrTest(testDF)
Test for equality:
identical(a$cumwork, as.integer(b$cumwork))
[1] TRUE
identical(a$work, as.integer(b$work))
[1] TRUE
identical(a$date, as.Date(b$date))
[1] TRUE
Benchmarks:
library(microbenchmark)
microbenchmark(DplyrTest(testDF), ExpandDates(testDF,1))
Unit: milliseconds
expr min lq mean median uq max neval cld
DplyrTest(testDF) 80.909303 84.337006 91.315057 86.320883 88.818739 173.69395 100 b
ExpandDates(testDF, 1) 1.122384 1.208184 2.521693 1.355564 1.486317 72.23444 100 a
Upvotes: 0
Reputation: 8377
A solution similar, with base R (and zoo
package):
dd$date <- as.Date(as.character(dd$date))
my.seq <- data.frame(date=seq.Date(from=range(dd$date)[1], to=range(dd$date)[2], by="day"))
output <- merge(my.seq, dd, all.x=TRUE)
output <- zoo::na.locf(output)
You first have to transform your date into a Date format. Then separately create a vector of complete dates and merge it with the original data. Eventually, run a "last observation carried forward" algorithm.
Upvotes: 0
Reputation: 5704
A possible way:
First, create the sequence of dates you're interested in as a one-column dataframe:
v <- data.frame(date = seq(min(dd$date), as.Date("1999-04-10"), by="day"))
Next, join with your original dataframe and fill the missing values, for instance using dplyr
and zoo
:
library(dplyr)
library(zoo)
v %>%
left_join(dd, by = "date") %>%
na.locf
NB: I suppose that your dataframe dd
actually contains dates (and not factors).
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
Upvotes: 1