JC3019
JC3019

Reputation: 383

Calculate quantiles over matrix columns

For the given matrix, I need to track quantile values per column via a grouped variable. Specifically, I want to group output by "Cohort Structure". Then for columns 1 to 5, I want to calculate 25th, mean and 75th percentiles per the grouped variable. This means my output matrix will be 9 x 5. i.e. 3 rows per Cohort Structure =1, 3 rows for Cohort Structure =2 and 3 rows for Cohort Structure =3, each corresponding to 25th mean and 75th percentiles.

example:

test.mat <- data.frame(matrix(nrow = 11, ncol =6))
colnames(test.mat)[[6]] = "Cohort Structure"
test.mat[,6]= c(1,1,1,1,1,1,2,2,3,3,3)
test.mat[1:11,4:5] <- rnorm(11*2,0,1)
test.mat[11, 5] <- NA
test.mat[1:3,1:3] <- rnorm(9,0,1)

           X1        X2          X3         X4          X5 Cohort Structure
1  0.09529937 1.0140776 -0.45203406 -0.6585827  0.57117571                1
2  0.94442513 0.5777710  0.08588911 -0.3674672  0.01383938                1
3  1.47881362 0.4370171 -0.37843416 -1.2634002  0.58010696                1
4          NA        NA          NA  0.2844687  0.83113773                1
5          NA        NA          NA  0.8661393  0.35947394                1
6          NA        NA          NA -1.3685556 -0.71297431                1
7          NA        NA          NA -1.0117586  0.27020197                2
8          NA        NA          NA -0.7746377  0.97250990                2
9          NA        NA          NA -1.4406549  0.05538031                3
10         NA        NA          NA -0.2303378 -0.61625365                3
11         NA        NA          NA -0.1837904          NA                3

Desired Output (Output Matrix):

For columns 1:3 and rows 3:9 output matrix will be NA. Column 1, Rows 1:3 will report 25th, mean, 75th percentile values for cohort structure =1. This process will be repeated for column 2 and column 3.

At columns 4 and 5 repeat the process of calculating 25th, mean and 75th quantile for each cohort structure. Calculate values excluding NAs.

quantile(test.mat[1:3,1], c(0.25,0.5,0.75))
quantile(test.mat[1:3,2], c(0.25,0.5,0.75))
quantile(test.mat[1:3,3], c(0.25,0.5,0.75))

would be the desired output for Output Matrix[1:3,1:3]

quantile(test.mat[1:6,4], c(0.25,0.5,0.75))

would yiled the desired output for output Matrix [1:3,4]

for my actual dataset i need the process to apply to a matrix with 100 columns

Upvotes: 2

Views: 941

Answers (1)

cddt
cddt

Reputation: 549

Using data.table, I believe this produces the correct output. There may be a cleaner way to write it.

library(data.table)
test.mat <- data.table(test.mat)
quantiles <- test.mat[, .(quantile(X1, c(0.25, 0.5, 0.75), na.rm = TRUE), 
                          quantile(X2, c(0.25, 0.5, 0.75), na.rm = TRUE), 
                          quantile(X3, c(0.25, 0.5, 0.75), na.rm = TRUE), 
                          quantile(X4, c(0.25, 0.5, 0.75), na.rm = TRUE), 
                          quantile(X5, c(0.25, 0.5, 0.75), na.rm = TRUE)), 
                       by = 'Cohort Structure']

And add some labels so we know which row we're looking at:

quantiles[, quantile := c(0.25, 0.5, 0.75)]

Output:

> quantiles
   Cohort Structure        V1         V2         V3         V4          V5 quantile
1:                1 -1.220385 -0.3937794 0.05349869  0.3436015 -0.76662468     0.25
2:                1 -1.127379  0.3001190 0.88924650  0.9198491  0.09188820     0.50
3:                1 -1.013713  0.4744223 1.04911208  1.3364680  0.90340622     0.75
4:                2        NA         NA         NA  0.2912628 -0.20866542     0.25
5:                2        NA         NA         NA  0.2968669 -0.07529148     0.50
6:                2        NA         NA         NA  0.3024710  0.05808246     0.75
7:                3        NA         NA         NA -1.0510155 -0.64431366     0.25
8:                3        NA         NA         NA -0.4571571 -0.24590377     0.50
9:                3        NA         NA         NA  0.1136005  0.15250612     0.75

Edit: An alternative, which works with an arbitrary number of columns is:

quantiles <- test.mat[ , lapply(.SD, quantile, c(0.25, 0.5, 0.75), na.rm = TRUE), by = 'Cohort Structure']
quantiles[, quantile := c(0.25, 0.5, 0.75)]

Output is still consistent:

> quantiles
   Cohort Structure         X1       X2         X3         X4          X5 quantile
1:                1 -0.7882032 1.026384 -1.1975511 -0.8922598 -0.14365438     0.25
2:                1 -0.5700479 1.053239 -0.7222268  0.4451031  0.03217004     0.50
3:                1  0.3405146 1.282465 -0.5917531  0.9224831  0.24087650     0.75
4:                2         NA       NA         NA  0.3324551  0.97672542     0.25
5:                2         NA       NA         NA  0.7927529  1.03910678     0.50
6:                2         NA       NA         NA  1.2530508  1.10148814     0.75
7:                3         NA       NA         NA -0.3269997  0.51067050     0.25
8:                3         NA       NA         NA  0.4094524  0.55328059     0.50
9:                3         NA       NA         NA  0.6502998  0.59589067     0.75

Upvotes: 1

Related Questions