Reputation: 87
I have been able to create a table with the median
mean
and sd
, based on several conditions, using the dplry
package. To do so I prepped the data. It looks a bit sloppy to me and I am using a big dataset (>3gb) so I have been using data.table
already.
I am wondering how I can create a similar table more efficiently using the data.table
package, so, if it is possible to do this without prepping the data too much. That would be great as I need to do this for many other conditions as well. Btw, I checked out this post but it didn't add conditions.
This is what I have and need:
I need the mean
of con
by id
and date
= 1 year since first i
or i2
= A04
or A01
.
EDIT (because it wasn't running before):
DATA:
DT <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 113L, 113L, 113L, 113L, 551L, 551L),
i = c("D95", "F85", "A01", "A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12", "D95"),
i2 = c("F15", "", "", "", "", "", "", "", "", "A01", "", "A01", ""),
date = c("2015-06-19", "2016-08-15", "2013-03-16", "2017-01-17", "2013-01-16", "2009-05-08", "2011-04-03", "2015-05-04",
"2011-04-04", "2017-08-04", "2011-05-24", "2013-11-04", "2013-05-04"),
con = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
PS = c("0", "0", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0")),
class = "data.frame", row.names = c(NA, -13L))
The following is what I have done so far (and which works):
I created the columns PS
(= having either A01
or A04
in i
or i2
) and ds
(=days since first A01
or A01
(days aren't correct here)) based on the data.
With the dplyr
package I first made a data.table
with only rows with only ds
between -365&0.
j.ds <- subset(DT, ds >= -365 & ds < 0)
Than agregated to a table with sum
of con
per id
, like so:
j <- j.ds %>%
group_by(id) %>%
summarise(con = sum(con))
From there I made the desired table:
jP <- j %>%
summarise(median = median(con), mean = mean(con), SD = sd(con))
Upvotes: 3
Views: 659
Reputation: 6226
Since I cannot run your example, I give you an untested data.table
translation of your dplyr
code.
You can do the operations you were doing quite efficiently in data.table
:
library(data.table)
setDT(jx)
j <- jx[ds >= -365 & ds < 0, lapply(.SD, sum), by = "id"]
jP <- j[, lapply(.SD, function(x) return(c(median(x), mean(x), sd(x)))]
jP[,'stat' := c("median","mean","sd")
You end-up with a long formatted dataframe.
If you have non-numeric columns, you must not sum them. You can control the columns used in .SD
with .SDcols
. For instance, in your example, the columns are character: to do a sum you need to convert them to numeric. For instance, if you are interested in con
and PS
columns, you should do:
cols <- c("con","PS")
j <- DT[,lapply(.SD, function(x) sum(as.numeric(x))), by = "id", .SDcols = cols]
With function(x) sum(as.numeric(x))
you create an anonymous function (equivalent to lambda
functions in python
that applies as.numeric and sum all elements in your subset of data
(SD).
The anonymous function return(c(median(x), mean(x), sd(x)))
follows the same logic
Upvotes: 1
Reputation: 4169
As far as I can tell you want the sum of con
per id
, for data where ds
is between -365 and 0, then the mean, median and standard deviation of those sums. This can be done by chaining two expressions.
DT[ds >= -365 & ds < 0, sum(con), by = "id"][, .(mean(V1), median(V1), sd(V1))]
This first subsets your data, then sums con
per id
, then takes the mean etc. of the sums with a second expression
Upvotes: 2