Reputation: 347
I have two dataframes (df1
and df2
). I am working with dplyr
to manipulate my data. However, I have some trouble finding the following result :
df1
contains some information about id
, price
, and date
(id
is not unique : a given id
can decide of several prices)
df2
can tell if for a given id there has been a modification of the value of price and/or date in df1
I want to know if there has been a modification of price and/or date, and if that's the case, I want to take this new value as the price/date
However, both df1
and df2
can be a little tricky since you can have several modifications for a given id.
More specifically, for a given modification of price (if it exists, otherwise I take the price given in df1
), I want to associate it with the last modification of date (if it exists, otherwise I take the date given in df1) as long as it is <= df1$date + 30
To sum it up, here's an example:
df1 <- data.frame(
Id = c(1,1,2),
price = c(1000,2000,1000),
date = c("2016-01-01","2016-09-01","2016-01-01")
)
df1
Id price date
1 1000 2016-01-01
1 2000 2016-09-01
2 1000 2016-01-01
And df2 is the following :
df2 <- data.frame(
Id = c(1,1,1,1,1,2,2),
price = c(1500,NA,2000,NA,3000,NA,NA),
date = c(NA, "2016-01-03", "2016-01-05", "2016-09-02","2016-09-03","2016-01-03","2016-01-05")
)
df2
Id price date
1 1500 <NA>
1 NA 2016-01-03
1 2000 2016-01-05
1 NA 2016-09-02
1 3000 2016-09-03
2 NA 2016-01-03
2 NA 2016-01-05
And the result I wish to have something similar to this :
Id initial_price initial_date is_modification_price is_modification_date true_price true_date
1 1000 2016-01-01 TRUE TRUE 2000 2016-01-05
1 2000 2016-09-01 TRUE TRUE 3000 2016-09-03
2 1000 2016-01-01 FALSE TRUE 1000 2016-01-05
I hope I'm clear enough
Does anyone have an idea of how to implement this ; or even a completely different approach ?
Upvotes: 3
Views: 91
Reputation: 5704
First, prepare your dataframes:
# fix type
df1 <- mutate(df1, date = as.Date(date))
# fill NAs in df2
df2 <- df2 %>%
mutate(date = as.Date(date)) %>%
group_by(Id) %>%
tidyr::fill(price, date) %>%
ungroup
# fill remaining NAs with default values taken from df1
default_values <- df1 %>%
group_by(Id) %>%
slice(1) %>%
rename(price0 = price, date0 = date) %>%
ungroup
df2 <- df2 %>%
left_join(default_values, by = "Id") %>%
mutate(price = if_else(is.na(price), price0, price),
date = if_else(is.na(date), date0, date)) %>%
select(Id, price, date)
Then join:
df1 %>%
left_join(df2, by = "Id") %>%
filter(date.y <= date.x + 30) %>%
group_by(Id, price.x, date.x) %>%
arrange(date.y) %>%
slice(n()) %>%
ungroup %>%
rename(initial_price = price.x, initial_date = date.x,
true_price = price.y, true_date = date.y) %>%
mutate(is_modification_price = (initial_price != true_price),
is_modification_date = (initial_date != true_date))
# # A tibble: 3 x 7
# Id initial_price initial_date true_price true_date is_modification_price is_modification_date
# <dbl> <dbl> <date> <dbl> <date> <lgl> <lgl>
# 1 1 1000 2016-01-01 2000 2016-01-05 TRUE TRUE
# 2 1 2000 2016-09-01 3000 2016-09-03 TRUE TRUE
# 3 2 1000 2016-01-01 1000 2016-01-05 FALSE TRUE
Note that the left_join
followed by filter
in the last step could take too much memory. If it's the case, use the non-equi join functionality in data.table
instead.
Upvotes: 1