Reputation: 7147
I have the following data:
# A tibble: 7,971 x 10
symbol date open high low close volume adjusted start_date end_date
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <date>
1 AAPL 2009-01-02 12.3 13.0 12.2 13.0 186503800 11.4 2009-07-31 2010-06-30
2 AAPL 2009-01-05 13.3 13.7 13.2 13.5 295402100 11.8 2009-07-31 2010-06-30
3 AAPL 2009-01-06 13.7 13.9 13.2 13.3 322327600 11.6 2009-07-31 2010-06-30
4 AAPL 2009-01-07 13.1 13.2 12.9 13.0 188262200 11.4 2009-07-31 2010-06-30
5 AAPL 2009-01-08 12.9 13.3 12.9 13.2 168375200 11.6 2009-07-31 2010-06-30
6 AAPL 2009-01-09 13.3 13.3 12.9 12.9 136711400 11.3 2009-07-31 2010-06-30
7 AAPL 2009-01-12 12.9 13.0 12.5 12.7 154429100 11.1 2009-07-31 2010-06-30
8 AAPL 2009-01-13 12.6 12.8 12.3 12.5 199599400 11.0 2009-07-31 2010-06-30
9 AAPL 2009-01-14 12.3 12.5 12.1 12.2 255416000 10.7 2009-07-31 2010-06-30
10 AAPL 2009-01-15 11.5 12.0 11.4 11.9 457908500 10.4 2009-07-31 2010-06-30
I am trying to group by symbol, start date and end date and take the difference between the first observation on the start date and the last observation on the end date. I just can't seem to get it working.
That is take the difference of the "close" on the start date and the "close" on the end date.
Any help would be great, thanks!
syms <- c("AAPL", "MSFT", "GOOG")
library(tidyquant)
data <- tq_get(syms)
data <- data %>%
mutate( start_date = paste(year(date %m+% months(6)), "07", "31", sep = "-"), # note this is the start_date for when we calculate the returns - we will have bought this portfolio on the 1st July but we get returns on the 31st
end_date = paste(year(date %m+% months(18)), "06", "30", sep = "-"),
start_date = as.Date(start_date),
end_date = as.Date(end_date))
My attempt...
data %>%
group_by(symbol, start_date, end_date) %>%
summarise(diff = diff(close))
EDIT:
I am trying to group by symbol
and then take start_date
- end_date
. So first, I should be grouping by symbol and filtering the date
column down to between the start_date
and end_date
values. i.e. I am only interested in the "close" price on the start_date
and end_date
days (which is fixed). Then just take the difference between the close price on the start_date
and end_date
. So most of the stock price data is useless here and I am only interested in the close on the start_date
and end_date
then take the difference between these two values.
Upvotes: 1
Views: 7538
Reputation: 389325
I think what you are looking for is to subtract first
and last
close
value for each group
library(dplyr)
data %>%
group_by(symbol, start_date, end_date) %>%
summarise(diff = first(close) - last(close))
# symbol start_date end_date diff
# <chr> <date> <date> <dbl>
# 1 AAPL 2009-07-31 2010-06-30 -7.38
# 2 AAPL 2010-07-31 2011-06-30 -15.5
# 3 AAPL 2011-07-31 2012-06-30 -12.5
# 4 AAPL 2012-07-31 2013-06-30 -34.4
# 5 AAPL 2013-07-31 2014-06-30 28.0
# 6 AAPL 2014-07-31 2015-06-30 -34.5
# 7 AAPL 2015-07-31 2016-06-30 -31.9
# 8 AAPL 2016-07-31 2017-06-30 31
# 9 AAPL 2017-07-31 2018-06-30 -48.1
#10 AAPL 2018-07-31 2019-06-30 -41.6
# … with 26 more rows
Another way to write it could be
data %>%
group_by(symbol, start_date, end_date) %>%
summarise(diff = close[1L] - close[n()])
Or it can be also done using base R aggregate
aggregate(close~symbol +start_date + end_date,data,function(x) x[1L] - x[length(x)])
Upvotes: 4
Reputation: 1114
You can take this approach...
# create df of unique symbol, start, and end date combos
df1 <- df %>% distinct(symbol,start_date,end_date)
# join original data that match the desired start/end dates
df1 <- df %>% select(start_close=close,symbol,start_date=date) %>% left_join(df1,.)
df1 <- df %>% select(end_close=close,symbol,end_date=date) %>% left_join(df1,.)
# find difference in close values
df1 %>% mutate(diff=end_close - start_close)
# A tibble: 36 x 6
symbol start_date end_date start_close end_close diff
<chr> <date> <date> <dbl> <dbl> <dbl>
1 AAPL 2009-07-31 2010-06-30 23.3 35.9 12.6
2 AAPL 2010-07-31 2011-06-30 NA 48.0 NA
3 AAPL 2011-07-31 2012-06-30 NA NA NA
4 AAPL 2012-07-31 2013-06-30 87.3 NA NA
5 AAPL 2013-07-31 2014-06-30 64.6 92.9 28.3
6 AAPL 2014-07-31 2015-06-30 95.6 125. 29.8
7 AAPL 2015-07-31 2016-06-30 121. 95.6 -25.7
8 AAPL 2016-07-31 2017-06-30 NA 144. NA
9 AAPL 2017-07-31 2018-06-30 149. NA NA
10 AAPL 2018-07-31 2019-06-30 190. NA NA
# ... with 26 more rows
There are NA
s since not every start/end date are in the original date column.
Upvotes: 2