Reputation: 341
Good morning all, this is my first time posting on stack overflow. Thank you for any help!
I have 2 dataframes that I am using to analyze stock data. One data frame has dates among other information, we can call it df:
df1 <- tibble(Key = c('a','b','c'), i =11:13, date= ymd(20110101:20110103))
The second dataframe also has dates and other important information.
df2 <-tibble(Answer = c('a','d','e','b','f','c'), j =14:19, date= ymd(20150304:20150309))
Here is what I want to do: For each row in df1, I need to:
-Find the date in df2 where, for when df2$answer is the same as df1$key, it is the closest to that row's date in df1.
-Then extract information for another column in that row in df2, and put it in a new row in df1.
The code i tried:
df1 %>%
group_by(Key, i) %>%
mutate(
`New Column` = df2$j[
which.min(subset(df2$date, df2$Answer== Key) - date)])
This has the result:
Key i date `New Column`
1 a 11 2011-01-01 14
2 b 12 2011-01-02 14
3 c 13 2011-01-03 14
This is correct for the first row, a
. In df2
, the closest date is 2015-03-04
, for which the value of j
is in fact 14
.
However, for the second row, Key=b
, I want df2
to subset to only look at dates for rows where df2$Answer = b
. Therefore, the date should be 2015-03-07
, for which j =17
.
Thank you for your help!
Jesse
Upvotes: 1
Views: 854
Reputation: 5689
This should work:
library(dplyr)
df1 %>%
left_join(df2, by = c("Key" = "Answer")) %>%
mutate(date_diff = abs(difftime(date.x, date.y, units = "secs"))) %>%
group_by(Key) %>%
arrange(date_diff) %>%
slice(1) %>%
ungroup()
We are first joining the two data frames with left_join
. Yes, I'm aware there are possibly multiple dates for each Key
, bear with me.
Next, we calculate (with mutate
) the absolute value (abs
) of the difference between the two dates date.x
and date.y
.
Now that we have this, we will group the data by Key
using group_by
. This will make sure that each distinct Key
will be treated separately in subsequent calculations.
Since we've calculated the date_diff
, we can now re-order (arrange
) the data for each Key
, with the smallest date_diff
as first for each Key
.
Finally, we are only interested in that first, smallest date_diff
for each Key
, so we can discard the rest using slice(1)
.
This pipeline gives us the following:
Key i date.x j date.y date_diff
<chr> <int> <date> <int> <date> <time>
1 a 11 2011-01-01 14 2015-03-04 131587200
2 b 12 2011-01-02 17 2015-03-07 131760000
3 c 13 2011-01-03 19 2015-03-09 131846400
Upvotes: 2