astrae_research
astrae_research

Reputation: 83

By group: sum of variable values under condition

Sum of var values by group with certain values excluded conditioned on the other variable. How to do it elegantly without transposing? So in the table below for each (fTicker, DATE_f), I seek to sum the values of wght with the value of wght conditioned on sTicker excluded from the sum.

In the table below, (excl_val,sTicker=A) |(fTicker=XLK, DATE_f = 6/20/2003) = wght_AAPL_6/20/2003_XLK + wght_AA_6/20/2003_XLK but not the wght for sTicker=A

+---------+---------+-----------+-------------+-------------+
| sTicker | fTicker |  DATE_f   |    wght     |  excl_val   |
+---------+---------+-----------+-------------+-------------+
| A       | XLK     | 6/20/2003 | 0.087600002 | 1.980834016 |
| A       | XLK     | 6/23/2003 | 0.08585     | 1.898560068 |
| A       | XLK     | 6/24/2003 | 0.085500002 |             |
| AAPL    | XLK     | 6/20/2003 | 0.070080002 |             |
| AAPL    | XLK     | 6/23/2003 | 0.06868     |             |
| AAPL    | XLK     | 6/24/2003 | 0.068400002 |             |
| AA      | XLK     | 6/20/2003 | 1.910754014 |             |
| AA      | XLK     | 6/23/2003 | 1.829880067 |             |
| AA      | XLK     | 6/24/2003 | 1.819775    |             |
|         |         |           |             |             |
|         |         |           |             |             |
+---------+---------+-----------+-------------+-------------+

There are several fTicker groups with many sTicker in them (10 to 70), some sTicker may belong to several fTicker. The end result should be an excl_val for each sTicker on each DATE_f and for each fTicker.

I did it by transposing in SAS with resulting file about 6 gb but the same approach in R, blew memory up to 40 gb and it's basically unworkable.

In R, I got as far as this

weights$excl_val  <- with(weights, aggregate(wght, list(fTicker, DATE_f), sum, na.rm=T))

but it's just a simple sum (without excluding the necessary observation) and there is mismatch between rows length. If i could condition the sum to exclude the sTicker obs for wght from the summation, i think it might work.

About the excl_val length: i computed it in excel, for just 2 cells, that's why it's short

Thank you!

Arsenio

Upvotes: 2

Views: 2377

Answers (1)

Vincent Zoonekynd
Vincent Zoonekynd

Reputation: 32351

When you have data in a data.frame, it is better if the rows are meaningful (in particular, the columns should have the same length): in this case, excl_val looks like a separate vector. After putting the information it contains in the data.frame, things become easier.

# Sample data
k <- 5
d <- data.frame(
  sTicker = rep(LETTERS[1:k], k),
  fTicker = rep(LETTERS[1:k], each=k),
  DATE_f = sample( seq(Sys.Date(), length=2, by=1), k*k, replace=TRUE ),
  wght = runif(k*k)
)
excl_val <- sample(d$wght, k)
# Add a "valid" column to the data.frame
d$valid <- ! d$wght %in% excl_val
# Compute the sum
library(plyr)
ddply(d, c("fTicker","DATE_f"), summarize, sum=sum(wght[valid]))

Upvotes: 1

Related Questions