MBB
MBB

Reputation: 347

Trying to merge two dataframes with specific conditions and gap in the rows in R

I have two dataframes (df1 and df2). I am working with dplyr to manipulate my data. However, I have some trouble finding the following result :

df1 contains some information about id, price, and date (id is not unique : a given id can decide of several prices)

df2 can tell if for a given id there has been a modification of the value of price and/or date in df1

I want to know if there has been a modification of price and/or date, and if that's the case, I want to take this new value as the price/date

However, both df1 and df2 can be a little tricky since you can have several modifications for a given id.

More specifically, for a given modification of price (if it exists, otherwise I take the price given in df1), I want to associate it with the last modification of date (if it exists, otherwise I take the date given in df1) as long as it is <= df1$date + 30

To sum it up, here's an example:

df1 <- data.frame(
       Id = c(1,1,2),
       price = c(1000,2000,1000),
       date = c("2016-01-01","2016-09-01","2016-01-01")
    )
df1
  Id price       date
  1  1000   2016-01-01
  1  2000   2016-09-01
  2  1000   2016-01-01

And df2 is the following :

df2 <- data.frame(
   Id = c(1,1,1,1,1,2,2),
   price = c(1500,NA,2000,NA,3000,NA,NA),
   date = c(NA, "2016-01-03", "2016-01-05", "2016-09-02","2016-09-03","2016-01-03","2016-01-05")
)

df2
  Id price       date
  1  1500       <NA>
  1    NA   2016-01-03
  1  2000   2016-01-05
  1    NA   2016-09-02
  1  3000   2016-09-03
  2    NA   2016-01-03
  2    NA   2016-01-05

And the result I wish to have something similar to this :

  Id initial_price  initial_date   is_modification_price  is_modification_date true_price    true_date

  1   1000          2016-01-01          TRUE                     TRUE             2000       2016-01-05
  1   2000          2016-09-01          TRUE                     TRUE             3000       2016-09-03
  2   1000          2016-01-01          FALSE                    TRUE             1000       2016-01-05

I hope I'm clear enough

Does anyone have an idea of how to implement this ; or even a completely different approach ?

Upvotes: 3

Views: 91

Answers (1)

Scarabee
Scarabee

Reputation: 5704

First, prepare your dataframes:

# fix type
df1 <- mutate(df1, date = as.Date(date))

# fill NAs in df2
df2 <- df2 %>%
  mutate(date = as.Date(date)) %>%
  group_by(Id) %>%
  tidyr::fill(price, date) %>%
  ungroup

# fill remaining NAs with default values taken from df1
default_values <- df1 %>%
  group_by(Id) %>%
  slice(1) %>%
  rename(price0 = price, date0 = date) %>%
  ungroup

df2 <- df2 %>%
  left_join(default_values, by = "Id") %>%
  mutate(price = if_else(is.na(price), price0, price),
         date = if_else(is.na(date), date0, date)) %>%
  select(Id, price, date)

Then join:

df1 %>%
  left_join(df2, by = "Id") %>%
  filter(date.y <= date.x + 30) %>%
  group_by(Id, price.x, date.x) %>%
  arrange(date.y) %>%
  slice(n()) %>%
  ungroup %>%
  rename(initial_price = price.x, initial_date = date.x,
         true_price = price.y, true_date = date.y) %>%
  mutate(is_modification_price = (initial_price != true_price),
         is_modification_date = (initial_date != true_date))
# # A tibble: 3 x 7
#      Id initial_price initial_date true_price  true_date is_modification_price is_modification_date
#   <dbl>         <dbl>       <date>      <dbl>     <date>                 <lgl>                <lgl>
# 1     1          1000   2016-01-01       2000 2016-01-05                  TRUE                 TRUE
# 2     1          2000   2016-09-01       3000 2016-09-03                  TRUE                 TRUE
# 3     2          1000   2016-01-01       1000 2016-01-05                 FALSE                 TRUE

Note that the left_join followed by filter in the last step could take too much memory. If it's the case, use the non-equi join functionality in data.table instead.

Upvotes: 1

Related Questions