subset data.table columns for calculation of new value but keep all old columns + new value

Question

I had a hard time phrasing the title of the question, so if someone can suggest a more accurate title, I would be grateful.

I have a matrix of ~100 samples and readings at ~200,000 specific locations. The locations are string identifiers, the sample readings are numeric. Example data:

library(data.table)
str_col = c("str1", "str2", "str3")
s1 = c(54.4, 19.0, 89.0)
s2 = c(46.6, 39.5, 85.2)
s3 = c(12.3, 0.2, 55.8)

dt = data.table(str_col, s1, s2, s3)
dt
   str_col   s1   s2   s3
1:    str1 54.4 46.6 12.3
2:    str2 19.0 39.5  0.2
3:    str3 89.0 85.2 55.8

I would like to calculate the standard deviation of the readings at each location (rows) for the sample columns (s1 - s3) while leaving out the location identifier str_col.

My attempt at this was:

dt[, -1, with=FALSE][, stdv := apply(.SD, 1, sd)]
dt
   str_col   s1   s2   s3
1:    str1 54.4 46.6 12.3
2:    str2 19.0 39.5  0.2
3:    str3 89.0 85.2 55.8

however this just returned the original data.table, as you can see.

I can do my desired operation in steps as follows:

dt_str_col = dt[,.(str_col)]
dt2 = dt[, -1, with=FALSE][, stdv := apply(.SD, 1, sd)]
dt3 = data.table(dt_str_col, dt2)
dt3
   str_col   s1   s2   s3     stdv
1:    str1 54.4 46.6 12.3 22.39695
2:    str2 19.0 39.5  0.2 19.65613
3:    str3 89.0 85.2 55.8 18.17067

But I was wondering if there was a way to do this by reference in data.table, similar to my first attempt dt[, -1, with=FALSE][, stdv := apply(.SD, 1, sd)] ?

rafa.pereira · Accepted Answer

I believe this would solve your problem, wouldn't it ?

dt[ ,  sdd := sd(.SD), by = str_col]

dt
#>    str_col   s1   s2   s3  sdd
#> 1:    str1 54.4 46.6 12.3 22.4
#> 2:    str2 19.0 39.5  0.2 19.7
#> 3:    str3 89.0 85.2 55.8 18.2

In case there is more than one row per str_cols (i.e. you really want to calculate the standard deviation by rows), you can do this:

# create a column with row positions
dt[, rowpos := .I]

dt[ ,  sdd := sd(.SD[, -1, with=FALSE]), by = rowpos]

subset data.table columns for calculation of new value but keep all old columns + new value

Answers (2)

Related Questions