Jesse Kerr
Jesse Kerr

Reputation: 341

R dplyr rowwise mutate

Good morning all, this is my first time posting on stack overflow. Thank you for any help!

I have 2 dataframes that I am using to analyze stock data. One data frame has dates among other information, we can call it df:

df1 <- tibble(Key = c('a','b','c'), i =11:13, date= ymd(20110101:20110103))

The second dataframe also has dates and other important information.

df2 <-tibble(Answer = c('a','d','e','b','f','c'), j =14:19, date= ymd(20150304:20150309))

Here is what I want to do: For each row in df1, I need to:

-Find the date in df2 where, for when df2$answer is the same as df1$key, it is the closest to that row's date in df1.

-Then extract information for another column in that row in df2, and put it in a new row in df1.

The code i tried:

df1 %>%
 group_by(Key, i) %>%
 mutate(
`New Column` =  df2$j[
  which.min(subset(df2$date, df2$Answer== Key) - date)])

This has the result:

Key       i date       `New Column`
1 a        11 2011-01-01           14
2 b        12 2011-01-02           14
3 c        13 2011-01-03           14

This is correct for the first row, a. In df2, the closest date is 2015-03-04, for which the value of j is in fact 14.

However, for the second row, Key=b, I want df2 to subset to only look at dates for rows where df2$Answer = b. Therefore, the date should be 2015-03-07, for which j =17.

Thank you for your help!

Jesse

Upvotes: 1

Views: 854

Answers (1)

Dave Gruenewald
Dave Gruenewald

Reputation: 5689

This should work:

library(dplyr)
df1 %>% 
  left_join(df2, by = c("Key" = "Answer")) %>% 
  mutate(date_diff = abs(difftime(date.x, date.y, units = "secs"))) %>% 
  group_by(Key) %>% 
  arrange(date_diff) %>% 
  slice(1) %>% 
  ungroup()

We are first joining the two data frames with left_join. Yes, I'm aware there are possibly multiple dates for each Key, bear with me.

Next, we calculate (with mutate) the absolute value (abs) of the difference between the two dates date.x and date.y.

Now that we have this, we will group the data by Key using group_by. This will make sure that each distinct Key will be treated separately in subsequent calculations.

Since we've calculated the date_diff, we can now re-order (arrange) the data for each Key, with the smallest date_diff as first for each Key.

Finally, we are only interested in that first, smallest date_diff for each Key, so we can discard the rest using slice(1).

This pipeline gives us the following:

  Key       i date.x         j date.y     date_diff
  <chr> <int> <date>     <int> <date>     <time>   
1 a        11 2011-01-01    14 2015-03-04 131587200
2 b        12 2011-01-02    17 2015-03-07 131760000
3 c        13 2011-01-03    19 2015-03-09 131846400

Upvotes: 2

Related Questions