Reputation: 185
I'm attempting to do a merge - i.e. link two datasets based on a common string. The variable I'm trying to link, however, changes overtime, so the merge needs to account for the date in order to link the correct value. Rather than have a matrix for the value to link at each date, I have one which gives the date of each time the value changes.
For example let's say I'd want to merge the price of apples and oranges onto a list of apples and oranges purchased on certain dates. My first dataframe (transactions) contains a the date a purchase took place, and whether it was an apple or an orange that was purchased. The second data frame contains the dates where the price of apples and oranges changed, and what it changed to (in this example prices change on the 1 January, but really it could be any date.
> transactions <- data.frame(Date_Purchased = as.Date(c("02/01/2018", "02/01/2020", "02/01/2019", "02/01/2020"), format = "%d/%m/%Y"), Item_Purchased = c("APPLE", "APPLE", "ORANGE", "ORANGE"))
> transactions
Date_Purchased Item_Purchased
1 2018-01-02 APPLE
2 2020-01-02 APPLE
3 2019-01-02 ORANGE
4 2020-01-02 ORANGE
>price <- data.frame(Date=as.Date(c("01/01/2018", "01/01/2019", "01/01/2020", "01/01/2018", "01/01/2019", "01/01/2020"), format = "%d/%m/%Y"), Item = c("APPLE", "APPLE", "APPLE", "ORANGE", "ORANGE", "ORANGE"), Price = c(0.30, 0.35, 0.40, 0.60, 0.70, 0.75))
> price
Date Item Price
1 2018-01-01 APPLE 0.30
2 2019-01-01 APPLE 0.35
3 2020-01-01 APPLE 0.40
4 2018-01-01 ORANGE 0.60
5 2019-01-01 ORANGE 0.70
6 2020-01-01 ORANGE 0.75
The cost of an apple on January 2nd 2018 is 30c, and its cost on January 2nd 2020 is 40c. Similarly the cost of an orange on January 2nd 2019 is 70c and January 2nd 2020 75c.
As such I need the merged dataset to look like:
Date_Purchased Item_Purchased Price_On_Date_Purchased
1 2018-01-02 APPLE 0.30
2 2020-01-02 APPLE 0.40
3 2019-01-02 ORANGE 0.70
4 2020-01-02 ORANGE 0.75
Unfortunately I'm really limited on the machine that I'm on in that I don't have access to the CRAN library and can't download additional packages, which means I haven't been able to use the neardate() function in what I've tried, which I think would be useful.
This is a level above what I'm used to doing on R so I'm at a bit of a loss to be honest.
Upvotes: 1
Views: 76
Reputation: 73352
Using apply
.
transform(d1,
Price_On_Date_Purchased=
apply(d1, 1, function(x)
tail(d2[d2$Item == x["Item_Purchased"] &
d2$Date <= x["Date_Purchased"], "Price"], 1)))
# Date_Purchased Item_Purchased Price_On_Date_Purchased
# 1 2018-01-02 APPLE 0.30
# 2 2020-01-02 APPLE 0.40
# 3 2019-01-02 ORANGE 0.70
# 4 2020-01-02 ORANGE 0.75
Upvotes: 1
Reputation: 389165
Since you cannot download additional packages here is a base R approach :
transactions$Price_On_Date_Purchased <- unlist(
by(transactions, transactions$Item_Purchased, function(x) {
tmp <- subset(price, Item == x$Item_Purchased)
tmp$Price[findInterval(x$Date, tmp$Date)]
}))
transactions
# Date_Purchased Item_Purchased Price_On_Date_Purchased
#1 2018-01-02 APPLE 0.30
#2 2020-01-02 APPLE 0.40
#3 2019-01-02 ORANGE 0.70
#4 2020-01-02 ORANGE 0.75
We divide transactions
based on Item_Purchased
, subset
the corresponding items from price
dataframe. Using findInterval
we find the appropriate date in which the price was changed and get the corresponding Price
value.
Upvotes: 1
Reputation: 319
This should do the trick. I'm sure it can be vectorized, but I'm not too good with applies
transactions$Price <- 0
items <- unique(transactions$Item_Purchased)
for(item_i in seq_along(items)){
date_idx <- findInterval(transactions$Date_Purchased[transactions$Item_Purchased==items[item_i]],
price$Date[price$Item==items[item_i]])
transactions$Price[transactions$Item_Purchased==items[item_i]] <- price$Price[date_idx]
}
Upvotes: 1