Reputation: 25
I would like to create dummy variables that indicate which product version was in effect based on the date range, i.e. between the release date and the following version release date without doing it manually. I have a few hundred versions with their corresponding release dates and when the next version was released, and I will merge/join this data with a Purchases data frame.
Currently I have two data frames (Version and Purchases) that look like this:
View(Version)
Type Version Release_Date Next_Release
A 1.2.3 2013-11-14 2014-01-11
B 1.3.1 2014-01-11 2014-02-20
A 1.5.1 2014-02-20 2014-03-08
A 1.5.2 2014-03-08 2014-04-06
B 1.5.3 2014-04-06 2014-04-12
A 1.5.4 2014-04-12 2014-04-15
B 1.5.5 2014-05-15 2014-05-20
B 1.6.1 2014-05-20 2014-06-26
A 1.6.2 2014-06-26 2014-07-14
View(Purchases)
TIMESTAMP Amount
2013-11-14 15.44
2013-11-14 13.39
2013-11-14 15.35
2013-11-15 86.43
2014-01-15 12.30
2014-01-17 23.55
I would like to create dummy variables for each version in the data frame 'Purchases' according to the date range which the version was in effect. i.e. if the timestamp of the purchase is within the date range that that version was in effect, that version dummy = 1 otherwise 0.
View(Purchases)
TIMESTAMP Amount Version_1.2.3 Version_1.3.1 ....
2013-11-14 15.44 1 0
2013-11-14 13.39 1 0
2013-11-14 15.35 1 0
2013-11-15 86.43 1 0
2014-01-15 12.30 0 1
2014-01-17 23.55 0 1
....
Thanks you in advance for any advise or help.
Upvotes: 0
Views: 700
Reputation: 42544
The task can be divided in two sub-tasks:
Release_Date
before the TIMESTAMP
of the purchase,For finding the most recent release, the data.table
package offers two possibilities,
a rolling join
library(data.table)
setDT(Version)
setDT(Purchases)
Version[Purchases, on = .(Release_Date = TIMESTAMP), roll = TRUE,
.(TIMESTAMP = i.TIMESTAMP, Amount, Version)]
TIMESTAMP Amount Version 1: 2013-11-14 15.44 1.2.3 2: 2013-11-14 13.39 1.2.3 3: 2013-11-14 15.35 1.2.3 4: 2013-11-15 86.43 1.2.3 5: 2014-01-15 12.30 1.3.1 6: 2014-01-17 23.55 1.3.1
or a non-equi join
Version[Purchases, on = .(Release_Date <= TIMESTAMP), mult = "last",
.(TIMESTAMP = i.TIMESTAMP, Amount, Version)]
which produces the same result.
For reshaping, the dcast()
function is used with length()
for aggregation:
# rolling join
Version[Purchases, on = .(Release_Date = TIMESTAMP), roll = TRUE,
.(TIMESTAMP = i.TIMESTAMP, Amount, Version)][
, dcast(.SD, TIMESTAMP + Amount ~ Version, length)]
TIMESTAMP Amount 1.2.3 1.3.1 1: 2013-11-14 13.39 1 0 2: 2013-11-14 15.35 1 0 3: 2013-11-14 15.44 1 0 4: 2013-11-15 86.43 1 0 5: 2014-01-15 12.30 0 1 6: 2014-01-17 23.55 0 1
or, in case the columns shall be renamed during reshaping
# non-equi join
Version[Purchases, on = .(Release_Date <= TIMESTAMP), mult = "last",
.(TIMESTAMP = i.TIMESTAMP, Amount, Version)][
, dcast(.SD, TIMESTAMP + Amount ~ paste0("Version_", Version), length)]
TIMESTAMP Amount Version_1.2.3 Version_1.3.1 1: 2013-11-14 13.39 1 0 2: 2013-11-14 15.35 1 0 3: 2013-11-14 15.44 1 0 4: 2013-11-15 86.43 1 0 5: 2014-01-15 12.30 0 1 6: 2014-01-17 23.55 0 1
Upvotes: 1