How to create means and s.d. columns with data.table (based on multiple conditions)

Question

I have been able to create a table with the median mean and sd, based on several conditions, using the dplry package. To do so I prepped the data. It looks a bit sloppy to me and I am using a big dataset (>3gb) so I have been using data.table already.

I am wondering how I can create a similar table more efficiently using the data.table package, so, if it is possible to do this without prepping the data too much. That would be great as I need to do this for many other conditions as well. Btw, I checked out this post but it didn't add conditions.

This is what I have and need:

I need the mean of con by id and date= 1 year since first i or i2 = A04 or A01.

EDIT (because it wasn't running before):

DATA:

DT <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 113L, 113L, 113L, 113L, 551L, 551L), 
               i = c("D95", "F85", "A01", "A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12", "D95"), 
               i2 = c("F15", "", "", "", "", "", "", "", "", "A01", "", "A01", ""), 
               date = c("2015-06-19", "2016-08-15", "2013-03-16", "2017-01-17", "2013-01-16", "2009-05-08", "2011-04-03", "2015-05-04",
                        "2011-04-04", "2017-08-04", "2011-05-24", "2013-11-04", "2013-05-04"), 
               con = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1",  "1", "1", "1"), 
               PS = c("0",  "0", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0")), 
          class = "data.frame", row.names = c(NA, -13L))

The following is what I have done so far (and which works):

I created the columns PS (= having either A01 or A04 in i or i2) and ds(=days since first A01 or A01 (days aren't correct here)) based on the data.

With the dplyr package I first made a data.table with only rows with only ds between -365&0.

j.ds <- subset(DT, ds >= -365 & ds < 0)

Than agregated to a table with sum of con per id, like so:

j <- j.ds %>%
  group_by(id) %>% 
  summarise(con = sum(con))

From there I made the desired table:

jP <- j %>%
  summarise(median = median(con), mean = mean(con), SD = sd(con))

rg255 · Accepted Answer

As far as I can tell you want the sum of con per id, for data where ds is between -365 and 0, then the mean, median and standard deviation of those sums. This can be done by chaining two expressions.

DT[ds >= -365 & ds < 0, sum(con), by = "id"][, .(mean(V1), median(V1), sd(V1))]

This first subsets your data, then sums con per id, then takes the mean etc. of the sums with a second expression

How to create means and s.d. columns with data.table (based on multiple conditions)

Answers (2)

Edit

Related Questions