storing average of some columns in a new column in R

Question

I have a data set of gene expressions in R as follows:

> head(sheet2)
  Gene.symbol log2FoldChange pvalue padj BEC141A BEC142A BEC167A BEC191A BEC199A TuBEC46A
1        SAA1          -6.37      0    0   64.00    2.00   56.00   18.00   21.00  8779.92
2     ADAMTS9          -4.58      0    0   21.00   23.00   50.00   16.00   11.00   223.00
3       TENM2          -5.24      0    0  139.00  147.00   51.00   10.00   45.00  2978.00
4        NRP2          -2.74      0    0  462.00  468.00 1008.00  134.00  233.00  3021.00
5        DPP4          -5.67      0    0   14.00   32.00    3.00    8.00    9.00   374.00
6       ZBTB1          -1.02      0    0  376.63  502.95  379.24  303.57  353.64   775.71
  TuBEC58A TuBEC60A TuBEC81A
1  1682.37   624.12    64.00
2  2202.00   234.00    30.00
3 10029.00   108.00   157.00
4  6621.00  2182.00  1802.00
5  2708.00    23.00     6.00
6  1304.89   697.78   792.81

I want to create the average and standard deviation for BEC and TuBEC in 4 different column for each row. I created the following loop for it:

for (i in 1:nrow(sheet2)){
  sheet2$BECaverages <-  colMeans(sheet2[i, c(5:9)])
}
for (i in 1:nrow(sheet2)){
  sheet2$TuBECaverage <-  colMeans(sheet2[i, c(10:13)])
}
#>    Error in `$<-.data.frame`(`*tmp*`, "TuBECaverage", value = c(TuBEC46A = 8779.92,  : 
#>    replacement has 4 rows, data has 35

This loop gives an error for the second column I'm trying to create. Can you help me fix it?

r2evans · Accepted Answer

dplyr

library(dplyr)
sheet2 %>%
  rowwise() %>%
  mutate(
    BECmu = mean(c_across(matches("^BEC[0-9]"))),
    BECsigma = sd(c_across(matches("^BEC[0-9]"))), 
    TuBECmu = mean(c_across(matches("^TuBEC[0-9]"))), 
    TuBECsigma = sd(c_across(matches("^TuBEC[0-9]")))
  ) %>%
  ungroup()
# A tibble: 6 x 17
#   Gene.symbol log2FoldChange pvalue  padj BEC141A BEC142A BEC167A BEC191A BEC199A TuBEC46A TuBEC58A TuBEC60A TuBEC81A BECmu BECsigma TuBECmu TuBECsigma
#                                                                   
# 1 SAA1                 -6.37      0     0     64       2      56      18      21     8780.    1682.     624.      64   32.2     26.5   2788.      4051.
# 2 ADAMTS9              -4.58      0     0     21      23      50      16      11      223     2202      234       30   24.2     15.2    672.      1024.
# 3 TENM2                -5.24      0     0    139     147      51      10      45     2978    10029      108      157   78.4     61.1   3318       4671.
# 4 NRP2                 -2.74      0     0    462     468    1008     134     233     3021     6621     2182     1802  461      338.    3406.      2203.
# 5 DPP4                 -5.67      0     0     14      32       3       8       9      374     2708       23        6   13.2     11.2    778.      1298.
# 6 ZBTB1                -1.02      0     0    377.    503.    379.    304.    354.     776.    1305.     698.     793. 383.      73.5    893.       278.

Note: if we instead use starts_with("BEC") (like I did at first), then the calculation of BECmu is correct, but then BECsigma mistakenly uses BECmu in its calculation of standard deviation, which is not what we want. I get around that by making sure we only use columns that are BEC followed by a number, what I'm inferring is a safe pattern given your data. If not, there are perhaps other ways to exclude BECmu from that, including creating a variable that does not start with BEC (at least initially ... it can be renamed later).

data.table

library(data.table)
as.data.table(sheet2)[, c("BECmu", "BECsigma") := .(mean(unlist(.SD)), sd(unlist(.SD))), 
    .SDcols = patterns("^BEC"), by = seq_len(nrow(sheet2))
  ][, c("TuBECmu", "TuBECsigma") := .(mean(unlist(.SD)), sd(unlist(.SD))),
    .SDcols = patterns("^TuBEC"), by = seq_len(nrow(sheet2))][]
#    Gene.symbol log2FoldChange pvalue  padj BEC141A BEC142A BEC167A BEC191A BEC199A TuBEC46A TuBEC58A TuBEC60A TuBEC81A   BECmu  BECsigma   TuBECmu TuBECsigma
#                                                                        
# 1:        SAA1          -6.37      0     0   64.00    2.00   56.00   18.00   21.00  8779.92  1682.37   624.12    64.00  32.200  26.53677 2787.6025  4050.8466
# 2:     ADAMTS9          -4.58      0     0   21.00   23.00   50.00   16.00   11.00   223.00  2202.00   234.00    30.00  24.200  15.15586  672.2500  1024.1271
# 3:       TENM2          -5.24      0     0  139.00  147.00   51.00   10.00   45.00  2978.00 10029.00   108.00   157.00  78.400  61.08028 3318.0000  4670.8009
# 4:        NRP2          -2.74      0     0  462.00  468.00 1008.00  134.00  233.00  3021.00  6621.00  2182.00  1802.00 461.000 338.44202 3406.5000  2202.6833
# 5:        DPP4          -5.67      0     0   14.00   32.00    3.00    8.00    9.00   374.00  2708.00    23.00     6.00  13.200  11.21160  777.7500  1297.9631
# 6:       ZBTB1          -1.02      0     0  376.63  502.95  379.24  303.57  353.64   775.71  1304.89   697.78   792.81 383.206  73.50789  892.7975   277.8243

base R

(Though this doesn't really feel awesome.)

sheet2 <- cbind(sheet2, t(apply(subset(sheet2, select = grepl("^BEC", colnames(sheet2))), 1, function(z) c(BECmu = mean(z), BECsigma = sd(z)))))
sheet2 <- cbind(sheet2, t(apply(subset(sheet2, select = grepl("^TuBEC", colnames(sheet2))), 1, function(z) c(TuBECmu = mean(z), TuBECsigma = sd(z)))))
sheet2
#   Gene.symbol log2FoldChange pvalue padj BEC141A BEC142A BEC167A BEC191A BEC199A TuBEC46A TuBEC58A TuBEC60A TuBEC81A   BECmu  BECsigma   TuBECmu TuBECsigma
# 1        SAA1          -6.37      0    0   64.00    2.00   56.00   18.00   21.00  8779.92  1682.37   624.12    64.00  32.200  26.53677 2787.6025  4050.8466
# 2     ADAMTS9          -4.58      0    0   21.00   23.00   50.00   16.00   11.00   223.00  2202.00   234.00    30.00  24.200  15.15586  672.2500  1024.1271
# 3       TENM2          -5.24      0    0  139.00  147.00   51.00   10.00   45.00  2978.00 10029.00   108.00   157.00  78.400  61.08028 3318.0000  4670.8009
# 4        NRP2          -2.74      0    0  462.00  468.00 1008.00  134.00  233.00  3021.00  6621.00  2182.00  1802.00 461.000 338.44202 3406.5000  2202.6833
# 5        DPP4          -5.67      0    0   14.00   32.00    3.00    8.00    9.00   374.00  2708.00    23.00     6.00  13.200  11.21160  777.7500  1297.9631
# 6       ZBTB1          -1.02      0    0  376.63  502.95  379.24  303.57  353.64   775.71  1304.89   697.78   792.81 383.206  73.50789  892.7975   277.8243

We could have used rowMeans for the *mu values, but there is not an immediate rowSDs ... it's easy enough to write that, but I'd likely start with apply(x, 1, sd), which is effectively what I'm doing here anyway.

The use of apply(sheet2, 1, ..) returns a matrix with 2 rows (not columns), so I transpose it to a 2-column return. It's still a matrix, but by using cbind where the first argument is sheet2 (a data.frame), the second argument is converted into frame columns, so "it works".

Data

sheet2 <- structure(list(Gene.symbol = c("SAA1", "ADAMTS9", "TENM2", "NRP2", "DPP4", "ZBTB1"), log2FoldChange = c(-6.37, -4.58, -5.24, -2.74, -5.67, -1.02), pvalue = c(0L, 0L, 0L, 0L, 0L, 0L), padj = c(0L, 0L, 0L, 0L, 0L, 0L), BEC141A = c(64, 21, 139, 462, 14, 376.63), BEC142A = c(2, 23, 147, 468, 32, 502.95), BEC167A = c(56, 50, 51, 1008, 3, 379.24), BEC191A = c(18, 16, 10, 134, 8, 303.57), BEC199A = c(21, 11, 45, 233, 9, 353.64), TuBEC46A = c(8779.92, 223, 2978, 3021, 374, 775.71), TuBEC58A = c(1682.37,  2202, 10029, 6621, 2708, 1304.89), TuBEC60A = c(624.12, 234, 108, 2182, 23, 697.78), TuBEC81A = c(64, 30, 157, 1802, 6, 792.81)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))

storing average of some columns in a new column in R

Answers (2)

dplyr

data.table

base R

Related Questions