Wilkit
Wilkit

Reputation: 87

How to create means and s.d. columns with data.table (based on multiple conditions)

I have been able to create a table with the median mean and sd, based on several conditions, using the dplry package. To do so I prepped the data. It looks a bit sloppy to me and I am using a big dataset (>3gb) so I have been using data.table already.

I am wondering how I can create a similar table more efficiently using the data.table package, so, if it is possible to do this without prepping the data too much. That would be great as I need to do this for many other conditions as well. Btw, I checked out this post but it didn't add conditions.

This is what I have and need:

I need the mean of con by id and date= 1 year since first i or i2 = A04 or A01.

EDIT (because it wasn't running before):

DATA:

DT <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 113L, 113L, 113L, 113L, 551L, 551L), 
               i = c("D95", "F85", "A01", "A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12", "D95"), 
               i2 = c("F15", "", "", "", "", "", "", "", "", "A01", "", "A01", ""), 
               date = c("2015-06-19", "2016-08-15", "2013-03-16", "2017-01-17", "2013-01-16", "2009-05-08", "2011-04-03", "2015-05-04",
                        "2011-04-04", "2017-08-04", "2011-05-24", "2013-11-04", "2013-05-04"), 
               con = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1",  "1", "1", "1"), 
               PS = c("0",  "0", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0")), 
          class = "data.frame", row.names = c(NA, -13L))

The following is what I have done so far (and which works):

I created the columns PS (= having either A01 or A04 in i or i2) and ds(=days since first A01 or A01 (days aren't correct here)) based on the data.

With the dplyr package I first made a data.table with only rows with only ds between -365&0.

j.ds <- subset(DT, ds >= -365 & ds < 0)

Than agregated to a table with sum of con per id, like so:

j <- j.ds %>%
  group_by(id) %>% 
  summarise(con = sum(con))

From there I made the desired table:

jP <- j %>%
  summarise(median = median(con), mean = mean(con), SD = sd(con))

Upvotes: 3

Views: 659

Answers (2)

linog
linog

Reputation: 6226

Since I cannot run your example, I give you an untested data.table translation of your dplyr code.

You can do the operations you were doing quite efficiently in data.table:

library(data.table)
setDT(jx)
j <- jx[ds >= -365 & ds < 0, lapply(.SD, sum), by = "id"]
jP <- j[, lapply(.SD, function(x) return(c(median(x), mean(x), sd(x)))]
jP[,'stat' := c("median","mean","sd")

You end-up with a long formatted dataframe.

Edit

If you have non-numeric columns, you must not sum them. You can control the columns used in .SD with .SDcols. For instance, in your example, the columns are character: to do a sum you need to convert them to numeric. For instance, if you are interested in con and PS columns, you should do:

cols <- c("con","PS")
j <- DT[,lapply(.SD, function(x) sum(as.numeric(x))), by = "id", .SDcols = cols]

With function(x) sum(as.numeric(x)) you create an anonymous function (equivalent to lambda functions in python that applies as.numeric and sum all elements in your subset of data (SD).

The anonymous function return(c(median(x), mean(x), sd(x))) follows the same logic

Upvotes: 1

rg255
rg255

Reputation: 4169

As far as I can tell you want the sum of con per id, for data where ds is between -365 and 0, then the mean, median and standard deviation of those sums. This can be done by chaining two expressions.

DT[ds >= -365 & ds < 0, sum(con), by = "id"][, .(mean(V1), median(V1), sd(V1))]

This first subsets your data, then sums con per id, then takes the mean etc. of the sums with a second expression

Upvotes: 2

Related Questions