Reputation: 235
Here a minimal working example of my data:
library(data.table)
df <- data.table(date=as.Date(c("1999-01-04","1999-01-04","1999-01-04","1999-01-04","1999-01-04","1999-01-04","1999-01-04","1999-01-04")),volume=c(1000,1000,1200,1250,1200,1300,1250,1200),cp_flag=c("P","C","C","P","C","C","P","P"),price_in=c(10.1,120.4,100.3,0.1,90.2,45.7,99.1,7.4), price_out=c(12.4,122.1,102.0,0.6,99.1,48.1,100.0,8.1), dtm=c(10,10,12,12,12,15,15,12))
setorder(df,date,dtm,volume)
df
date volume cp_flag price_in price_out dtm
1: 1999-01-04 1000 P 10.1 12.4 10
2: 1999-01-04 1000 C 120.4 122.1 10
3: 1999-01-04 1200 C 100.3 102.0 12
4: 1999-01-04 1200 P 7.4 8.1 12
5: 1999-01-04 1200 C 90.2 99.1 13
6: 1999-01-04 1250 P 0.1 0.6 12
7: 1999-01-04 1250 P 99.1 100.0 15
8: 1999-01-04 1300 C 45.7 48.1 15
My goal is now: for each date I want to compute a customary function for all items with the same 1) volume and 2)dtm (aka duration), depending on whether it is a "C" or a "P" product, for example: volume/10+price_in[cp_flag=="C"]-price_out[cp_flag=="P"]
.
An additional layer of difficulty here is that there may be a different number of "P"s and "C"s for each date/volume/dtm combination (for example see volume = 1200), which I want to treat as described below.
As an output I look for
date volume dtm
1: 1999-01-04 1000 10
2: 1999-01-04 1200 12
3: 1999-01-04 1200 13
4: 1999-01-04 1250 12
5: 1999-01-04 1250 15
6: 1999-01-04 1300 15
with an additional column representing the result of the function above and where the length of the table is determined by all date/volume/dtm in the following manner:
price_in[cp_flag="C"]
and price_out[cp_flag="C"]
value of the same products, and the length of the result is the same as in the original table I believe that this can be done efficiently via the data.table methods, but I don't quite get it to work.
It seems natural to operate over .SD
s. So I first tried to expand each subset via
df[,print(.SD),by=.(date,volume,dtm),.SDcols=c("price_in","price_out","volume","cp_flag")]
This gives me all the wanted combinations:
price_in price_out cp_flag
1: 10.1 12.4 P
2: 120.4 122.1 C
price_in price_out cp_flag
1: 100.3 102.0 C
2: 7.4 8.1 P
price_in price_out cp_flag
1: 90.2 99.1 C
price_in price_out cp_flag
1: 0.1 0.6 P
price_in price_out cp_flag
1: 99.1 100 P
price_in price_out cp_flag
1: 45.7 48.1 C
But now I am not sure how to compute the customary function, i.e. check how many "C"s and "P"s there are in each group and then computing above's formula, i.e. volume/10+price_in[cp_flag=="C"]-price_out[cp_flag=="P"]
for all C's and P's. But in case there are only Cs or Ps use only their information, i.e. same product for price_in and price_out.
For the first part, I have tried something like
df[,lapply(.SD,function(x) x[cp_flag=="C",volume/10]+x[cp_flag=="C",price_in]-x[cp_flag=="P",price_out]),by=.(date,volume,dtm),.SDcols=c("price_in","price_out","volume","cp_flag")]
but this fails as I apparently seem to misunderstand how to use a custom function in this case.
Question: How do I properly use such a custom function on a subset of a data table with such additional case structure?
Note: I know this example looks quite complicated, and maybe I am too deep already and might have spent too much time cracking it, but I can't see an easier way to present my issue. If there is any further clearance I can give, please let me know. Any help is highly appreciated!
Upvotes: 1
Views: 286
Reputation: 66819
I guess this is it:
res = df[, {
flags = sort(unique(cp_flag))
n_flags = length(flags)
if (n_flags == 1L)
.(g = .GRP, price_in, price_out, flags = flags)
else CJ(
g = .GRP,
price_in = price_in[cp_flag == "C"],
price_out = price_out[cp_flag == "P"],
flags = toString(flags)
)
}, by=.(date, volume, dtm)][, v := volume/10 + price_in - price_out][]
date volume dtm g price_in price_out flags v
1: 1999-01-04 1000 10 1 120.4 12.4 C, P 208.0
2: 1999-01-04 1200 12 2 90.2 8.1 C, P 202.1
3: 1999-01-04 1200 12 2 100.3 8.1 C, P 212.2
4: 1999-01-04 1250 12 3 0.1 0.6 P 124.5
5: 1999-01-04 1250 15 4 99.1 100.0 P 124.1
6: 1999-01-04 1300 15 5 45.7 48.1 C 127.6
I wouldn't say this is efficient, but at least the calculations are done in a vectorized way.
Upvotes: 2