Bigchao
Bigchao

Reputation: 1756

Add dummies with conditions in data.table?

Sorry for the long question. I will try my best to clarify my goal clealy

I want to add dummies in data.table using the update method, just like this already answered in this link, but a little bit more complicated.

For better description, I created the data.

DT <- data.table(UID = paste0("UID",rep(1:5,each=2)), 
                 date = as.IDate(c("2012-01-01","2012-01-02","2012-01-03","2012-01-04","2012-01-05","2012-01-06","2012-02-01","2012-02-02","2012-02-03","2012-02-04")),
                 value = c(1:10)) 

The DT is a data.table containing the information of UID, date, and value. In the original data, the structure is just the same, but with long time span(2 years).

Here I want to add dummies based on the date.

there're several special time spans in date, we can just use vacations to represent them.

For example, in the fake data I created above.

There're two vacations

  1. From "2012-01-02" to "2012-01-05"
  2. From "2012-02-02" to "2012-02-03"

I want to add 2 types of dummies

  1. Dummies about the length of vacation: First calculate the length from different vacations. In this example, we have two different lengths (2, and 4 ). So we'll add 2 dummies indicating whether the date are in these vacations.

The expected result is like this:

UID     Date    Val D_length_2  D_length_4
UID1    1/1/2012    1   FALSE   FALSE
UID2    1/2/2012    2   FALSE   TRUE
UID3    1/3/2012    3   FALSE   TRUE
UID4    1/4/2012    4   FALSE   TRUE
UID5    1/5/2012    5   FALSE   TRUE
UID1    1/6/2012    6   FALSE   FALSE
UID2    2/1/2012    7   TRUE    FALSE
UID3    2/2/2012    8   TRUE    FALSE
UID4    2/3/2012    9   FALSE   FALSE
UID5    2/4/2012    10  FALSE   FALSE
  1. Dummies about whether the day is exactly one day before the vacation, or exactly one day after the vacation.
UID    Date      Val    Before  After
UID1    1/1/2012    1   TRUE    FALSE
UID2    1/2/2012    2   FALSE   FALSE
UID3    1/3/2012    3   FALSE   FALSE
UID4    1/4/2012    4   FALSE   FALSE
UID5    1/5/2012    5   FALSE   FALSE
UID1    1/6/2012    6   FALSE   TRUE
UID2    2/1/2012    7   TRUE    FALSE
UID3    2/2/2012    8   FALSE   FALSE
UID4    2/3/2012    9   FALSE   FALSE
UID5    2/4/2012    10  FALSE   TRUE

So the total of desired results is like this

UID Date    Val Before  After   D_length_2  D_length_4
UID1    1/1/2012    1   TRUE    FALSE   FALSE   FALSE
UID2    1/2/2012    2   FALSE   FALSE   FALSE   TRUE
UID3    1/3/2012    3   FALSE   FALSE   FALSE   TRUE
UID4    1/4/2012    4   FALSE   FALSE   FALSE   TRUE
UID5    1/5/2012    5   FALSE   FALSE   FALSE   TRUE
UID1    1/6/2012    6   FALSE   TRUE    FALSE   FALSE
UID2    2/1/2012    7   TRUE    FALSE   FALSE   FALSE
UID3    2/2/2012    8   FALSE   FALSE   TRUE    FALSE
UID4    2/3/2012    9   FALSE   FALSE   TRUE    FALSE
UID5    2/4/2012    10  FALSE   TRUE    FALSE   FALSE

The total observations are more than 10M rows, with about 10 different vacations and 4 different length.

For the second type of dummies, I think

f <- function(x){ 
ifelse(x %in% as.Date(c("2012-01-02","2012-02-02")) - 1, return(TRUE), return(FALSE))
}

DT[,Before:= f(date)] 

But it seems not correct.

For the first one, I didn't come up with a good solution.

this problem is about the update in data.table, any thoughts about how to deal with it and how to write the update functions are extremely welcome!

Upvotes: 5

Views: 1056

Answers (1)

Roland
Roland

Reputation: 132864

Here is a start:

library(data.table)

DT <- data.table(UID = paste0("UID",rep(1:5,each=2)), 
                 date = as.IDate(c("2012-01-01","2012-01-02","2012-01-03","2012-01-04","2012-01-05","2012-01-06","2012-02-01","2012-02-02","2012-02-03","2012-02-04")),
                 value = c(1:10)) 
setkey(DT, date)


vacStart <- data.table(start = as.IDate(c("2012-01-02", "2012-02-02")), key="start")
vacEnd <- data.table(date = as.IDate(c("2012-01-05", "2012-02-03")), key="date")

#identify vacations:
vacStart[, Start:=.I]
DT <- vacStart[DT, roll=TRUE]
vacEnd[, End:=.I]
DT <- vacEnd[DT, roll=-Inf]
DT[,vac:=(End==Start)*Start]
DT[is.na(vac), vac:=0L]

#2-day vacations:
DT[,length_2 := (.N==2) & vac!=0, by=vac]
#days before vacation
DT[,before := c(diff(vac)>0, FALSE) & vac==0]
#           date End Start  UID value vac length_2 before
#  1: 2012-01-01   1    NA UID1     1   0    FALSE   TRUE
#  2: 2012-01-02   1     1 UID1     2   1    FALSE  FALSE
#  3: 2012-01-03   1     1 UID2     3   1    FALSE  FALSE
#  4: 2012-01-04   1     1 UID2     4   1    FALSE  FALSE
#  5: 2012-01-05   1     1 UID3     5   1    FALSE  FALSE
#  6: 2012-01-06   2     1 UID3     6   0    FALSE  FALSE
#  7: 2012-02-01   2     1 UID4     7   0    FALSE   TRUE
#  8: 2012-02-02   2     2 UID4     8   2     TRUE  FALSE
#  9: 2012-02-03   2     2 UID5     9   2     TRUE  FALSE
# 10: 2012-02-04  NA     2 UID5    10   0    FALSE  FALSE

Upvotes: 6

Related Questions