Reputation: 383
For the given matrix, I need to track quantile values per column via a grouped variable. Specifically, I want to group output by "Cohort Structure". Then for columns 1 to 5, I want to calculate 25th, mean and 75th percentiles per the grouped variable. This means my output matrix will be 9 x 5. i.e. 3 rows per Cohort Structure =1, 3 rows for Cohort Structure =2 and 3 rows for Cohort Structure =3, each corresponding to 25th mean and 75th percentiles.
example:
test.mat <- data.frame(matrix(nrow = 11, ncol =6))
colnames(test.mat)[[6]] = "Cohort Structure"
test.mat[,6]= c(1,1,1,1,1,1,2,2,3,3,3)
test.mat[1:11,4:5] <- rnorm(11*2,0,1)
test.mat[11, 5] <- NA
test.mat[1:3,1:3] <- rnorm(9,0,1)
X1 X2 X3 X4 X5 Cohort Structure
1 0.09529937 1.0140776 -0.45203406 -0.6585827 0.57117571 1
2 0.94442513 0.5777710 0.08588911 -0.3674672 0.01383938 1
3 1.47881362 0.4370171 -0.37843416 -1.2634002 0.58010696 1
4 NA NA NA 0.2844687 0.83113773 1
5 NA NA NA 0.8661393 0.35947394 1
6 NA NA NA -1.3685556 -0.71297431 1
7 NA NA NA -1.0117586 0.27020197 2
8 NA NA NA -0.7746377 0.97250990 2
9 NA NA NA -1.4406549 0.05538031 3
10 NA NA NA -0.2303378 -0.61625365 3
11 NA NA NA -0.1837904 NA 3
Desired Output (Output Matrix):
For columns 1:3 and rows 3:9 output matrix will be NA. Column 1, Rows 1:3 will report 25th, mean, 75th percentile values for cohort structure =1. This process will be repeated for column 2 and column 3.
At columns 4 and 5 repeat the process of calculating 25th, mean and 75th quantile for each cohort structure. Calculate values excluding NAs.
quantile(test.mat[1:3,1], c(0.25,0.5,0.75))
quantile(test.mat[1:3,2], c(0.25,0.5,0.75))
quantile(test.mat[1:3,3], c(0.25,0.5,0.75))
would be the desired output for Output Matrix[1:3,1:3]
quantile(test.mat[1:6,4], c(0.25,0.5,0.75))
would yiled the desired output for output Matrix [1:3,4]
for my actual dataset i need the process to apply to a matrix with 100 columns
Upvotes: 2
Views: 941
Reputation: 549
Using data.table
, I believe this produces the correct output. There may be a cleaner way to write it.
library(data.table)
test.mat <- data.table(test.mat)
quantiles <- test.mat[, .(quantile(X1, c(0.25, 0.5, 0.75), na.rm = TRUE),
quantile(X2, c(0.25, 0.5, 0.75), na.rm = TRUE),
quantile(X3, c(0.25, 0.5, 0.75), na.rm = TRUE),
quantile(X4, c(0.25, 0.5, 0.75), na.rm = TRUE),
quantile(X5, c(0.25, 0.5, 0.75), na.rm = TRUE)),
by = 'Cohort Structure']
And add some labels so we know which row we're looking at:
quantiles[, quantile := c(0.25, 0.5, 0.75)]
Output:
> quantiles
Cohort Structure V1 V2 V3 V4 V5 quantile
1: 1 -1.220385 -0.3937794 0.05349869 0.3436015 -0.76662468 0.25
2: 1 -1.127379 0.3001190 0.88924650 0.9198491 0.09188820 0.50
3: 1 -1.013713 0.4744223 1.04911208 1.3364680 0.90340622 0.75
4: 2 NA NA NA 0.2912628 -0.20866542 0.25
5: 2 NA NA NA 0.2968669 -0.07529148 0.50
6: 2 NA NA NA 0.3024710 0.05808246 0.75
7: 3 NA NA NA -1.0510155 -0.64431366 0.25
8: 3 NA NA NA -0.4571571 -0.24590377 0.50
9: 3 NA NA NA 0.1136005 0.15250612 0.75
Edit: An alternative, which works with an arbitrary number of columns is:
quantiles <- test.mat[ , lapply(.SD, quantile, c(0.25, 0.5, 0.75), na.rm = TRUE), by = 'Cohort Structure']
quantiles[, quantile := c(0.25, 0.5, 0.75)]
Output is still consistent:
> quantiles
Cohort Structure X1 X2 X3 X4 X5 quantile
1: 1 -0.7882032 1.026384 -1.1975511 -0.8922598 -0.14365438 0.25
2: 1 -0.5700479 1.053239 -0.7222268 0.4451031 0.03217004 0.50
3: 1 0.3405146 1.282465 -0.5917531 0.9224831 0.24087650 0.75
4: 2 NA NA NA 0.3324551 0.97672542 0.25
5: 2 NA NA NA 0.7927529 1.03910678 0.50
6: 2 NA NA NA 1.2530508 1.10148814 0.75
7: 3 NA NA NA -0.3269997 0.51067050 0.25
8: 3 NA NA NA 0.4094524 0.55328059 0.50
9: 3 NA NA NA 0.6502998 0.59589067 0.75
Upvotes: 1