user113156
user113156

Reputation: 7147

taking the difference between two dates dplyr

I have the following data:

# A tibble: 7,971 x 10
   symbol date        open  high   low close    volume adjusted start_date end_date  
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl> <date>     <date>    
 1 AAPL   2009-01-02  12.3  13.0  12.2  13.0 186503800     11.4 2009-07-31 2010-06-30
 2 AAPL   2009-01-05  13.3  13.7  13.2  13.5 295402100     11.8 2009-07-31 2010-06-30
 3 AAPL   2009-01-06  13.7  13.9  13.2  13.3 322327600     11.6 2009-07-31 2010-06-30
 4 AAPL   2009-01-07  13.1  13.2  12.9  13.0 188262200     11.4 2009-07-31 2010-06-30
 5 AAPL   2009-01-08  12.9  13.3  12.9  13.2 168375200     11.6 2009-07-31 2010-06-30
 6 AAPL   2009-01-09  13.3  13.3  12.9  12.9 136711400     11.3 2009-07-31 2010-06-30
 7 AAPL   2009-01-12  12.9  13.0  12.5  12.7 154429100     11.1 2009-07-31 2010-06-30
 8 AAPL   2009-01-13  12.6  12.8  12.3  12.5 199599400     11.0 2009-07-31 2010-06-30
 9 AAPL   2009-01-14  12.3  12.5  12.1  12.2 255416000     10.7 2009-07-31 2010-06-30
10 AAPL   2009-01-15  11.5  12.0  11.4  11.9 457908500     10.4 2009-07-31 2010-06-30

I am trying to group by symbol, start date and end date and take the difference between the first observation on the start date and the last observation on the end date. I just can't seem to get it working.

That is take the difference of the "close" on the start date and the "close" on the end date.

Any help would be great, thanks!

syms <- c("AAPL", "MSFT", "GOOG")

library(tidyquant)
data <- tq_get(syms)


data <- data %>%
  mutate(         start_date = paste(year(date %m+% months(6)), "07", "31", sep = "-"), # note this is the start_date for when we calculate the returns - we will have bought this portfolio on the 1st July but we get returns on the 31st
                  end_date = paste(year(date %m+% months(18)), "06", "30", sep = "-"),
                  start_date = as.Date(start_date),
                  end_date = as.Date(end_date))

My attempt...

data %>%
  group_by(symbol, start_date, end_date) %>%
  summarise(diff = diff(close))

EDIT:

I am trying to group by symbol and then take start_date - end_date. So first, I should be grouping by symbol and filtering the date column down to between the start_date and end_date values. i.e. I am only interested in the "close" price on the start_date and end_date days (which is fixed). Then just take the difference between the close price on the start_date and end_date. So most of the stock price data is useless here and I am only interested in the close on the start_date and end_date then take the difference between these two values.

Upvotes: 1

Views: 7538

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389325

I think what you are looking for is to subtract first and last close value for each group

library(dplyr)

data %>%
  group_by(symbol, start_date, end_date) %>%
  summarise(diff = first(close) - last(close))

#   symbol start_date end_date     diff
#   <chr>  <date>     <date>      <dbl>
# 1 AAPL   2009-07-31 2010-06-30  -7.38
# 2 AAPL   2010-07-31 2011-06-30 -15.5 
# 3 AAPL   2011-07-31 2012-06-30 -12.5 
# 4 AAPL   2012-07-31 2013-06-30 -34.4 
# 5 AAPL   2013-07-31 2014-06-30  28.0 
# 6 AAPL   2014-07-31 2015-06-30 -34.5 
# 7 AAPL   2015-07-31 2016-06-30 -31.9 
# 8 AAPL   2016-07-31 2017-06-30  31   
# 9 AAPL   2017-07-31 2018-06-30 -48.1 
#10 AAPL   2018-07-31 2019-06-30 -41.6 
# … with 26 more rows

Another way to write it could be

data %>% 
  group_by(symbol, start_date, end_date) %>% 
  summarise(diff = close[1L] - close[n()])

Or it can be also done using base R aggregate

aggregate(close~symbol +start_date + end_date,data,function(x) x[1L] - x[length(x)])

Upvotes: 4

kstew
kstew

Reputation: 1114

You can take this approach...

# create df of unique symbol, start, and end date combos
df1 <- df %>% distinct(symbol,start_date,end_date)

# join original data that match the desired start/end dates
df1 <- df %>% select(start_close=close,symbol,start_date=date) %>% left_join(df1,.)
df1 <- df %>% select(end_close=close,symbol,end_date=date) %>% left_join(df1,.)

# find difference in close values
df1 %>% mutate(diff=end_close - start_close)

# A tibble: 36 x 6
   symbol start_date end_date   start_close end_close  diff
   <chr>  <date>     <date>           <dbl>     <dbl> <dbl>
 1 AAPL   2009-07-31 2010-06-30        23.3      35.9  12.6
 2 AAPL   2010-07-31 2011-06-30        NA        48.0  NA  
 3 AAPL   2011-07-31 2012-06-30        NA        NA    NA  
 4 AAPL   2012-07-31 2013-06-30        87.3      NA    NA  
 5 AAPL   2013-07-31 2014-06-30        64.6      92.9  28.3
 6 AAPL   2014-07-31 2015-06-30        95.6     125.   29.8
 7 AAPL   2015-07-31 2016-06-30       121.       95.6 -25.7
 8 AAPL   2016-07-31 2017-06-30        NA       144.   NA  
 9 AAPL   2017-07-31 2018-06-30       149.       NA    NA  
10 AAPL   2018-07-31 2019-06-30       190.       NA    NA  
# ... with 26 more rows

There are NAs since not every start/end date are in the original date column.

Upvotes: 2

Related Questions