majom
majom

Reputation: 8021

Reshaping data based on Date objects using data.table

I use data.table for reshaping my data quite heavily. However, after updating the data.table package my code is not working any more.

I basically want to extend my dataset based on two columns (start.date and stop.date).

Please see the toy example below:

# Set up toy data
id <- letters[1:3]
start.date <- as.Date(c("2012-01-01", "2012-01-03", "2012-01-05"))
stop.date <- as.Date(c("2012-01-03", "2012-01-06", "2012-01-06"))
d <- data.table(id, start.date, stop.date)

# This is how the input data looks like
#    id start.date  stop.date
# 1:  a 2012-01-01 2012-01-03
# 2:  b 2012-01-03 2012-01-06
# 3:  c 2012-01-05 2012-01-06

# Working code with older version of data.table.
d.new <- d[, c(.SD, list(time=seq(start.date, stop.date, by="days"))), by=id] 

# The result looks like that:
#      id start.date  stop.date                                          V3
# 1:  a 2012-01-01 2012-01-03            2012-01-01,2012-01-02,2012-01-03
# 2:  b 2012-01-03 2012-01-06 2012-01-03,2012-01-04,2012-01-05,2012-01-06
# 3:  c 2012-01-05 2012-01-06                       2012-01-05,2012-01-06

This is how the final data should look like (and did look like before updating the data.table package)

#    id start.date  stop.date time
# 1:  a 2012-01-01 2012-01-03 2012-01-01
# 2:  a 2012-01-01 2012-01-03 2012-01-02
# 3:  a 2012-01-01 2012-01-03 2012-01-03
# 4:  b 2012-01-03 2012-01-06 2012-01-03
# 5:  b 2012-01-03 2012-01-06 2012-01-04
# 6:  b 2012-01-03 2012-01-06 2012-01-05
# 7:  b 2012-01-03 2012-01-06 2012-01-06
# 8:  c 2012-01-05 2012-01-06 2012-01-05
# 9:  c 2012-01-05 2012-01-06 2012-01-06

Upvotes: 2

Views: 161

Answers (1)

Arun
Arun

Reputation: 118779

Thanks for catching this one and also for filing the bug #861. This is now fixed in v1.9.5. From NEWS:

Some optimisations of .SD in j was done in 1.9.4, refer to #735. Due to an oversight, j-expressions of the form c(lapply(.SD, ...), list(...)) were optimised improperly. This is now fixed. Thanks to @mmeierer for filing #861.

That is:

d.new <- d[, c(.SD, list(time=seq(start.date, stop.date, by="days"))), by=id] 

will work as intended, but faster (as it is internally optimised - now correctly).

My earlier suggestion was how I thought it should work and had implemented that optimisation (which was incorrect). Now all good to go :-).

We plan to push the next release very soon with a bunch of quick high priority fixes to run smoothly.

Upvotes: 1

Related Questions