Reputation: 85
I have a data_frame containing 10 columns and 2000 rows. My sample data would look like:
rs_id Code Combination_Ag A.Ag Combination_Bg B.Ag Combination_Cg C.Ag
rs_1 0 1:01/1:01 1 13:02/13:02 1 03:04/03:04 6 1:01/1:01 1
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/15:02 1 1:01/15:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/2:05: 1 13:02/8:01 1 06:02/06:02 3 1:01/4:04 1
rs_1 2 1:01/24:02 3 14:01/14:02 1 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 1:01/31:01 1 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
I am trying to find the highest combination (A.Ag, B.Bg C.Ag) for rs_id =0, 1 and 2. How can I achieve this? The output would be
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/03:04 6 1:01/1:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/24:02 3 06:02/06:02 3 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
Upvotes: 1
Views: 80
Reputation: 42544
This approach reshapes the data from wide to long format (melting two measure columns simultaneously), picks the row with the top Ag
value for each unique combination of rs_id
, Code
, and variable
. Finally, the result is reshaped from long to wide format again with the column order rearranged to return the expected result:
library(data.table)
cols <- c("Combination", "Ag")
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"),
value.name = cols)[
, variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
, .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
, dcast(.SD, rs_id + Code ~ variable, value.var = cols)][
, setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))]
rs_id Code Combination_A Ag_A Combination_B Ag_B Combination_C Ag_C Combination_D Ag_D 1: rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/03:04 6 1:01/1:01 1 2: rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1 3: rs_1 2 1:01/24:02 3 13:02/8:01 1 06:02/06:02 3 1:01/4:04 3 4: rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1 5: rs_2 1 11:01/2:01 4 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2 6: rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
The OP has asked for an explanation of the last of the chained data.table
expressions setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))
.
This expressions orders the columns of the result by reference, i.e., without copying. When reshaping multiple value.var
s the columns are grouped by value.var
:
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"),
value.name = cols)[
, variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
, .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
, dcast(.SD, rs_id + Code ~ variable, value.var = cols)]
rs_id Code Combination_A Combination_B Combination_C Combination_D Ag_A Ag_B Ag_C Ag_D 1: rs_1 0 1:01/11:01 13:02/49:01 03:04/03:04 1:01/1:01 2 2 6 1 2: rs_1 1 1:01/2:01 13:02/57:01 03:04/7:01 1:01/3:01 6 1 2 1 3: rs_1 2 1:01/24:02 13:02/8:01 06:02/06:02 1:01/4:04 3 1 3 3 4: rs_2 0 1:01/3:01 14:01/7:02 06:02/2:02: 1:01/4:07 1 1 1 1 5: rs_2 1 11:01/2:01 15:01/15:01 06:02/3:03 1:01/7:01 4 1 1 2 6: rs_2 2 11:01/25:01 15:01/44:02 06:02/4:01 10:01/3:01 1 2 1 5
while the OP expects the output to be grouped by variable
. So the desired column order is
c(1, 2, 3, 7, 4, 8, 5, 9, 6, 10)
.
1
and 2
denote the id.var
columns. as.vector(outer(c(0, 4), 3:6, "+")))
is just a way to save typing 3, 7, 4, 8, 5, 9, 6, 10
.
outer(c(0, 4), 3:6, "+")
[,1] [,2] [,3] [,4] [1,] 3 4 5 6 [2,] 7 8 9 10
as.vector(outer(c(0, 4), 3:6, "+"))
[1] 3 7 4 8 5 9 6 10
The code can be further streamlined. The call to as.vector()
is not necessary inside c()
as c()
turns arrays into vectors. So, instead of
c(1:2, as.vector(outer(c(0, 4), 3:6, "+")))
we can write
c(1:2, outer(c(0, 4), 3:6, "+"))
Note that I have completed the missing column headers for the last two columns.
library(data.table)
DF <- fread(
"rs_id Code Combination_Ag A.Ag Combination_Bg B.Ag Combination_Cg C.Ag Combination_Dg D.Ag
rs_1 0 1:01/1:01 1 13:02/13:02 1 03:04/03:04 6 1:01/1:01 1
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/15:02 1 1:01/15:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/2:05: 1 13:02/8:01 1 06:02/06:02 3 1:01/4:04 1
rs_1 2 1:01/24:02 3 14:01/14:02 1 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 1:01/31:01 1 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5"
)
Upvotes: 3