Reputation: 4716
I have some (financial time-series) data in a tall format:
require(data.table)
DT <- data.table(Variable=c(rep("a",times = 3), rep("b", times=3)),
Date=as.Date(c("2014-04-01","2014-04-02","2014-04-03"
,"2014-04-02", "2014-04-03","2014-04-04")),
Value=c(1:3,3:1), key=c("Variable","Date"))
DT
Variable Date Value
1: a 2014-04-01 1
2: a 2014-04-02 2
3: a 2014-04-03 3
4: b 2014-04-02 3
5: b 2014-04-03 2
6: b 2014-04-04 1
I would like to calculate a third variable spread
, where spread = a - b
for each common Date
row (basically a spread between two time-series - a common transformation in the financial domain).
Desired output:
Variable Date Value
1: spread 2014-04-02 -1
2: spread 2014-04-03 1
I know of solution when I dcast.data.table
the data into the wide format (i.e. into table with columns c("Date", "a", "b")
), but due to performance issues on large data, is there an elegant way how to do this directly in the tall format using a)dplyr
and b)data.table
(two-part question)?
Ideally on the dplyr
side, I am looking for something as expressive as mutate(tbl_dt(DT, tall=TRUE), spread=a-b)
. (Disclaimer: I am a complete newbie in dplyr
)
Real-life dataset:
# download 200 stocks from Quandl.com. requires free registration
library(Quandl); library(data.table); library(plyr)
ntickers <- 200 ; auth.token="register_free_to_obtain_token"
code.file <- tempfile()
download.file("https://s3.amazonaws.com/quandl-static-content/quandl-stock-code-list.csv",
destfile=code.file)
tickers <- na.omit(read.csv2(code.file, sep=",", stringsAsFactors=FALSE)[,"Price.Code"])
lst <- na.omit(tickers)[1:ntickers]
names(lst) <- lst
Q <- ldply(lst, Quandl,
type = "raw", end_date="2014-04-08",
sort="asc", auth=auth.token) # might take minutes
DT <- as.data.table(Q)[,Date:=as.IDate(Date)]
setnames(DT, ".id", "Instrument")
setkey(DT, Instrument, Date)
> dim(DT); object.size(DT)
[1] 685512 8
41145752 bytes
> DT
Instrument Date Open High Low Close Volume Adjusted Close
1: GOOG/AMEX_ABI 1981-03-11 NA NA 6.56 6.75 217200 NA
2: GOOG/AMEX_ABI 1981-03-12 NA NA 6.66 6.88 616400 NA
3: GOOG/AMEX_ABI 1981-03-13 NA NA 6.81 6.84 462000 NA
4: GOOG/AMEX_ABI 1981-03-16 NA NA 6.81 7.00 306400 NA
5: GOOG/AMEX_ABI 1981-03-17 NA NA 6.88 6.88 925600 NA
---
685508: YAHOO/TSX_AHX_TO 2014-04-02 0.75 0.75 0.75 0.75 5000 0.75
685509: YAHOO/TSX_AHX_TO 2014-04-03 0.79 0.82 0.75 0.82 25700 0.82
685510: YAHOO/TSX_AHX_TO 2014-04-04 0.81 0.81 0.78 0.80 4500 0.80
685511: YAHOO/TSX_AHX_TO 2014-04-07 0.80 1.05 0.80 0.96 40400 0.96
685512: YAHOO/TSX_AHX_TO 2014-04-08 0.95 0.96 0.90 0.95 21300 0.95
Upvotes: 2
Views: 613
Reputation: 1756
Here's a data.table method by using dcast.data.table.
I hope I provide an useful start, and there're some following issues about selecting NAs, and speed gain.
# Create Dataset
require(data.table)
require(reshape2)
DT <- data.table(Variable=c(rep("a",times = 3), rep("b", times=3)),
Date=as.Date(c("2014-04-01","2014-04-02","2014-04-03"
,"2014-04-02", "2014-04-03","2014-04-04")),
Value=c(1:3,3:1), key=c("Variable","Date"))
# using data.table
DT2 <- dcast.data.table(DT, Date ~ Variable, drop=FALSE)
DT2[, spread:= a-b, by = Date][!is.na(spread),]
# Actually I'm not clear about the different between `drop= FALSE` and `drop = TRUE`
This is the output
Date a b spread 1: 2014-04-02 2 3 -1 2: 2014-04-03 3 2 1
Upvotes: 2
Reputation: 103938
Here's one approach with dplyr. First we create the data:
require(dplyr)
df <- data.frame(
Variable = rep(c("a", "b"), each = 3),
Date = rep(as.Date("2014-04-01") + 0:2, 2),
Value = c(1:3, 3:1)
)
Instead of rotating into a wide form, we could instead use a vectorised comparison:
df %.%
group_by(Date) %.%
summarise(spread = Value[Variable == "a"] - Value[Variable == "b"])
## Source: local data frame [3 x 2]
##
## Date spread
## 1 2014-04-01 -2
## 2 2014-04-02 0
## 3 2014-04-03 2
This will correctly fail if there are multiple values of a or b, because
summarise()
requires that results are of length one. The same approach
would work with data.table, but you'd need to be a little more careful
about checking the results (because data table is less strict/more
flexible here compared to dplyr).
You could also use the join approach suggested by BondedDust. It's not as quite as convenient with dplyr as it is with data.table:
a <- df %.% filter(Variable == "a") %.% select(-Variable)
b <- df %.% filter(Variable == "b") %.% select(-Variable)
inner_join(a, b, by = "Date") %.%
mutate(spread = Value.x - Value.y)
## Date Value.x Value.y spread
## 1 2014-04-01 1 3 -2
## 2 2014-04-02 2 2 0
## 3 2014-04-03 3 1 2
Upvotes: 3
Reputation: 263441
You ought to be able to work with this:
> merge(DT["a", ], DT["b",], by="Date")
Date Variable.x Value.x Variable.y Value.y
1: 2014-04-02 a 2 b 3
2: 2014-04-03 a 3 b 2
The help page for merge.data.table
suggests you read FAQ 1.12 for a detailed comparison of this with X[Y,...] approaches.
Upvotes: 4