EDennnis
EDennnis

Reputation: 331

Given only vector of dates, expand data in between (unequal) date points

Other questions have centered around having a start and end date. (see the following for examples Given start date and end date, reshape/expand data for each day between (each day on a row) Expand rows by date range using start and end date

My question is different in that I only have one date column and I would like to convert the unequal date ranges to daily counts. This specific example created deals with number of workers on a job site at one time. Different crews of people come on different dates

A brief data frame provided is as follows:

dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))

I would like the data to look like this:

dw <- data.frame(date=c(seq(as.Date("1999-03-22"),as.Date("1999-04-10"),by= "day")),
       work=c(rep(43,7),rep(95,10),rep(92,3)),
       cumwork=c(rep(43,7),rep(138,10),rep(230,3)))

I have been stuck on this for some time. Any help would be appreciated!

UPDATE (7/5/2017): As pointed out by @Scarabee the dates in the dataframe 'dd' should be in date format. Have updated the code to reflect this

Upvotes: 3

Views: 210

Answers (3)

Joseph Wood
Joseph Wood

Reputation: 7608

Here is a really fast pure base R solution:

ExpandDates <- function(df, lastColRepeat) {
    myDiff <- diff(df$date)
    dfOut <- data.frame(df$date[1] + 0:(sum(myDiff) + lastColRepeat - 1L),
                     stringsAsFactors=FALSE)
    myDiff <- c(myDiff, lastColRepeat)
    for (i in 2:3) {dfOut[,i] <- rep(df[ ,i], times = myDiff)}
    names(dfOut) <- names(df)
    dfOut
}

The last argument is to determine the number of times the last value should be repeated. As it stands, there is nothing in the original data.frame that would give this value. I'm also assuming that the "date" field is actually a date as pointed out by @Scarabee.

Here is some test data:

set.seed(123)
workVec <- sample(5000, 3000)
testDF <- data.frame(date = as.Date(sort(sample(12000, 3000)), 
                                    origin = "1970-01-01"), work = workVec, 
                                                    cumwork = cumsum(workVec))

DplyrTest <- function(dd) {  ## from @Scarabee
    v <- data.frame(date = seq(min(dd$date), max(dd$date), by="day"))
    v %>% 
        left_join(dd, by = "date") %>% 
        na.locf
}

a <- ExpandDates(testDF, 1)
b <- DplyrTest(testDF)

Test for equality:

identical(a$cumwork, as.integer(b$cumwork))
[1] TRUE
identical(a$work, as.integer(b$work))
[1] TRUE
identical(a$date, as.Date(b$date))
[1] TRUE

Benchmarks:

library(microbenchmark)
microbenchmark(DplyrTest(testDF), ExpandDates(testDF,1))
Unit: milliseconds
                  expr       min        lq      mean    median        uq       max neval cld
     DplyrTest(testDF) 80.909303 84.337006 91.315057 86.320883 88.818739 173.69395   100   b
ExpandDates(testDF, 1)  1.122384  1.208184  2.521693  1.355564  1.486317  72.23444   100  a 

Upvotes: 0

agenis
agenis

Reputation: 8377

A solution similar, with base R (and zoo package):

dd$date <- as.Date(as.character(dd$date))
my.seq <- data.frame(date=seq.Date(from=range(dd$date)[1], to=range(dd$date)[2], by="day"))
output <- merge(my.seq, dd, all.x=TRUE)
output <- zoo::na.locf(output)

You first have to transform your date into a Date format. Then separately create a vector of complete dates and merge it with the original data. Eventually, run a "last observation carried forward" algorithm.

Upvotes: 0

Scarabee
Scarabee

Reputation: 5704

A possible way:

First, create the sequence of dates you're interested in as a one-column dataframe:

v <- data.frame(date = seq(min(dd$date), as.Date("1999-04-10"), by="day"))

Next, join with your original dataframe and fill the missing values, for instance using dplyr and zoo:

library(dplyr)
library(zoo)

v %>% 
  left_join(dd, by = "date") %>% 
  na.locf

NB: I suppose that your dataframe dd actually contains dates (and not factors).

dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))

Upvotes: 1

Related Questions