Finding the Highest Frequencies in multiple columns

Question

I have a data_frame containing 10 columns and 2000 rows. My sample data would look like:

rs_id   Code    Combination_Ag  A.Ag    Combination_Bg  B.Ag    Combination_Cg  C.Ag
rs_1    0   1:01/1:01   1   13:02/13:02 1   03:04/03:04 6   1:01/1:01   1
rs_1    0   1:01/11:01  2   13:02/49:01 2   03:04/15:02 1   1:01/15:01  1
rs_1    1   1:01/2:01   6   13:02/57:01 1   03:04/7:01  2   1:01/3:01   1
rs_1    2   1:01/2:05:  1   13:02/8:01  1   06:02/06:02 3   1:01/4:04   1
rs_1    2   1:01/24:02  3   14:01/14:02 1   06:02/15:02 1   1:01/4:04   3
rs_2    0   1:01/3:01   1   14:01/7:02  1   06:02/2:02: 1   1:01/4:07   1
rs_2    1   1:01/31:01  1   15:01/15:01 1   06:02/3:03  1   1:01/7:01   2
rs_2    1   11:01/2:01  4   15:01/18:01 1   06:02/3:04  1   10:01/14:01 1
rs_2    2   11:01/25:01 1   15:01/44:02 2   06:02/4:01  1   10:01/3:01  5

I am trying to find the highest combination (A.Ag, B.Bg C.Ag) for rs_id =0, 1 and 2. How can I achieve this? The output would be

rs_1    0   1:01/11:01   2   13:02/49:01    2   03:04/03:04 6   1:01/1:01   1
rs_1    1   1:01/2:01    6   13:02/57:01    1   03:04/7:01  2   1:01/3:01   1
rs_1    2   1:01/24:02   3   06:02/06:02    3   06:02/15:02 1   1:01/4:04   3
rs_2    0   1:01/3:01    1   14:01/7:02     1   06:02/2:02: 1   1:01/4:07   1
rs_2    1   11:01/2:01   4   15:01/18:01    1   06:02/3:04  1   10:01/14:01 1 
rs_2    2   11:01/25:01  1   15:01/44:02    2   06:02/4:01  1   10:01/3:01  5

Uwe · Accepted Answer

This approach reshapes the data from wide to long format (melting two measure columns simultaneously), picks the row with the top Ag value for each unique combination of rs_id, Code, and variable. Finally, the result is reshaped from long to wide format again with the column order rearranged to return the expected result:

library(data.table)
cols <- c("Combination", "Ag")
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"), 
     value.name = cols)[
       , variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
         , .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
           , dcast(.SD, rs_id + Code ~ variable, value.var = cols)][
             , setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))]

   rs_id Code Combination_A Ag_A Combination_B Ag_B Combination_C Ag_C Combination_D Ag_D
1:  rs_1    0    1:01/11:01    2   13:02/49:01    2   03:04/03:04    6     1:01/1:01    1
2:  rs_1    1     1:01/2:01    6   13:02/57:01    1    03:04/7:01    2     1:01/3:01    1
3:  rs_1    2    1:01/24:02    3    13:02/8:01    1   06:02/06:02    3     1:01/4:04    3
4:  rs_2    0     1:01/3:01    1    14:01/7:02    1   06:02/2:02:    1     1:01/4:07    1
5:  rs_2    1    11:01/2:01    4   15:01/15:01    1    06:02/3:03    1     1:01/7:01    2
6:  rs_2    2   11:01/25:01    1   15:01/44:02    2    06:02/4:01    1    10:01/3:01    5

Edit

The OP has asked for an explanation of the last of the chained data.table expressions setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+")))).

This expressions orders the columns of the result by reference, i.e., without copying. When reshaping multiple value.vars the columns are grouped by value.var:

melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"), 
     value.name = cols)[
       , variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
         , .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
           , dcast(.SD, rs_id + Code ~ variable, value.var = cols)]

   rs_id Code Combination_A Combination_B Combination_C Combination_D Ag_A Ag_B Ag_C Ag_D
1:  rs_1    0    1:01/11:01   13:02/49:01   03:04/03:04     1:01/1:01    2    2    6    1
2:  rs_1    1     1:01/2:01   13:02/57:01    03:04/7:01     1:01/3:01    6    1    2    1
3:  rs_1    2    1:01/24:02    13:02/8:01   06:02/06:02     1:01/4:04    3    1    3    3
4:  rs_2    0     1:01/3:01    14:01/7:02   06:02/2:02:     1:01/4:07    1    1    1    1
5:  rs_2    1    11:01/2:01   15:01/15:01    06:02/3:03     1:01/7:01    4    1    1    2
6:  rs_2    2   11:01/25:01   15:01/44:02    06:02/4:01    10:01/3:01    1    2    1    5

while the OP expects the output to be grouped by variable. So the desired column order is

c(1, 2, 3, 7, 4, 8, 5, 9, 6, 10).

1 and 2 denote the id.var columns. as.vector(outer(c(0, 4), 3:6, "+"))) is just a way to save typing 3, 7, 4, 8, 5, 9, 6, 10.

outer(c(0, 4), 3:6, "+")

     [,1] [,2] [,3] [,4]
[1,]    3    4    5    6
[2,]    7    8    9   10

as.vector(outer(c(0, 4), 3:6, "+"))

[1]  3  7  4  8  5  9  6 10

Edit 2

The code can be further streamlined. The call to as.vector() is not necessary inside c() as c() turns arrays into vectors. So, instead of

c(1:2, as.vector(outer(c(0, 4), 3:6, "+")))

we can write

c(1:2, outer(c(0, 4), 3:6, "+"))

Data

Note that I have completed the missing column headers for the last two columns.

library(data.table)
DF <- fread(
  "rs_id   Code    Combination_Ag  A.Ag    Combination_Bg  B.Ag    Combination_Cg  C.Ag   Combination_Dg  D.Ag
rs_1    0   1:01/1:01   1   13:02/13:02 1   03:04/03:04 6   1:01/1:01   1
rs_1    0   1:01/11:01  2   13:02/49:01 2   03:04/15:02 1   1:01/15:01  1
rs_1    1   1:01/2:01   6   13:02/57:01 1   03:04/7:01  2   1:01/3:01   1
rs_1    2   1:01/2:05:  1   13:02/8:01  1   06:02/06:02 3   1:01/4:04   1
rs_1    2   1:01/24:02  3   14:01/14:02 1   06:02/15:02 1   1:01/4:04   3
rs_2    0   1:01/3:01   1   14:01/7:02  1   06:02/2:02: 1   1:01/4:07   1
rs_2    1   1:01/31:01  1   15:01/15:01 1   06:02/3:03  1   1:01/7:01   2
rs_2    1   11:01/2:01  4   15:01/18:01 1   06:02/3:04  1   10:01/14:01 1
rs_2    2   11:01/25:01 1   15:01/44:02 2   06:02/4:01  1   10:01/3:01  5"
)

Finding the Highest Frequencies in multiple columns

Answers (1)

Edit

Edit 2

Data

Related Questions