Sean
Sean

Reputation: 185

How to merge variable which is changing over time [R]

I'm attempting to do a merge - i.e. link two datasets based on a common string. The variable I'm trying to link, however, changes overtime, so the merge needs to account for the date in order to link the correct value. Rather than have a matrix for the value to link at each date, I have one which gives the date of each time the value changes.

For example let's say I'd want to merge the price of apples and oranges onto a list of apples and oranges purchased on certain dates. My first dataframe (transactions) contains a the date a purchase took place, and whether it was an apple or an orange that was purchased. The second data frame contains the dates where the price of apples and oranges changed, and what it changed to (in this example prices change on the 1 January, but really it could be any date.


> transactions <- data.frame(Date_Purchased = as.Date(c("02/01/2018", "02/01/2020", "02/01/2019", "02/01/2020"), format = "%d/%m/%Y"), Item_Purchased = c("APPLE", "APPLE", "ORANGE", "ORANGE"))

> transactions
  Date_Purchased Item_Purchased
1     2018-01-02          APPLE
2     2020-01-02          APPLE
3     2019-01-02         ORANGE
4     2020-01-02         ORANGE


>price <- data.frame(Date=as.Date(c("01/01/2018", "01/01/2019", "01/01/2020", "01/01/2018", "01/01/2019", "01/01/2020"), format = "%d/%m/%Y"), Item = c("APPLE", "APPLE", "APPLE", "ORANGE", "ORANGE", "ORANGE"), Price = c(0.30, 0.35, 0.40, 0.60, 0.70, 0.75))

> price
        Date   Item Price
1 2018-01-01  APPLE  0.30
2 2019-01-01  APPLE  0.35
3 2020-01-01  APPLE  0.40
4 2018-01-01 ORANGE  0.60
5 2019-01-01 ORANGE  0.70
6 2020-01-01 ORANGE  0.75

The cost of an apple on January 2nd 2018 is 30c, and its cost on January 2nd 2020 is 40c. Similarly the cost of an orange on January 2nd 2019 is 70c and January 2nd 2020 75c.

As such I need the merged dataset to look like:


  Date_Purchased Item_Purchased Price_On_Date_Purchased
1     2018-01-02          APPLE                    0.30
2     2020-01-02          APPLE                    0.40
3     2019-01-02         ORANGE                    0.70
4     2020-01-02         ORANGE                    0.75

Unfortunately I'm really limited on the machine that I'm on in that I don't have access to the CRAN library and can't download additional packages, which means I haven't been able to use the neardate() function in what I've tried, which I think would be useful.

This is a level above what I'm used to doing on R so I'm at a bit of a loss to be honest.

Upvotes: 1

Views: 76

Answers (3)

jay.sf
jay.sf

Reputation: 73352

Using apply.

transform(d1, 
          Price_On_Date_Purchased=
            apply(d1, 1, function(x) 
              tail(d2[d2$Item == x["Item_Purchased"] & 
                        d2$Date <= x["Date_Purchased"], "Price"], 1)))
#   Date_Purchased Item_Purchased Price_On_Date_Purchased
# 1     2018-01-02          APPLE                    0.30
# 2     2020-01-02          APPLE                    0.40
# 3     2019-01-02         ORANGE                    0.70
# 4     2020-01-02         ORANGE                    0.75

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389165

Since you cannot download additional packages here is a base R approach :

transactions$Price_On_Date_Purchased <- unlist(
     by(transactions, transactions$Item_Purchased, function(x) {
         tmp <- subset(price, Item == x$Item_Purchased)
         tmp$Price[findInterval(x$Date, tmp$Date)]
}))

transactions
#  Date_Purchased Item_Purchased Price_On_Date_Purchased
#1     2018-01-02          APPLE                    0.30
#2     2020-01-02          APPLE                    0.40
#3     2019-01-02         ORANGE                    0.70
#4     2020-01-02         ORANGE                    0.75

We divide transactions based on Item_Purchased, subset the corresponding items from price dataframe. Using findInterval we find the appropriate date in which the price was changed and get the corresponding Price value.

Upvotes: 1

Jarn Sch&#246;ber
Jarn Sch&#246;ber

Reputation: 319

This should do the trick. I'm sure it can be vectorized, but I'm not too good with applies

transactions$Price <- 0

items <- unique(transactions$Item_Purchased)

for(item_i in seq_along(items)){
   date_idx <- findInterval(transactions$Date_Purchased[transactions$Item_Purchased==items[item_i]],
                            price$Date[price$Item==items[item_i]])
   transactions$Price[transactions$Item_Purchased==items[item_i]] <- price$Price[date_idx]
}

Upvotes: 1

Related Questions