Davy Kavanagh
Davy Kavanagh

Reputation: 4939

Subset data.table using min condition

There is probably a really simple solution to this problem, but I couldn't find it from googling, or the data.table FAQ.

I have a data.table like so:

> test
    chr     bp          ID   REF       ALT        AF  AC   AN                EFFECT   IMPACT FUNCLASS CODING       GENE      pos effRank
 1:   1 860416  rs61464428     G         A 0.5000000  14   28              UPSTREAM MODIFIER          CODING     SAMD11 1:860416      21
 2:   1 860416  rs61464428     G         A 0.5000000  14   28              UPSTREAM MODIFIER          CODING     SAMD11 1:860416      21
 3:   1 860416  rs61464428     G         A 0.5000000  14   28            DOWNSTREAM MODIFIER          CODING AL645608.1 1:860416      22
 4:   1 860461  rs57465118     G         A 1.0000000  62   62              UPSTREAM MODIFIER          CODING     SAMD11 1:860461      21
 5:   1 860461  rs57465118     G         A 1.0000000  62   62              UPSTREAM MODIFIER          CODING     SAMD11 1:860461      21
 6:   1 860461  rs57465118     G         A 1.0000000  62   62            DOWNSTREAM MODIFIER          CODING AL645608.1 1:860461      22
 7:   1 860521  rs57924093     C         A 0.9840000  61   62              UPSTREAM MODIFIER          CODING     SAMD11 1:860521      21
 8:   1 860521  rs57924093     C         A 0.9840000  61   62              UPSTREAM MODIFIER          CODING     SAMD11 1:860521      21
 9:   1 860521  rs57924093     C         A 0.9840000  61   62            DOWNSTREAM MODIFIER          CODING AL645608.1 1:860521      22
10:   1 861261 rs144896029     G         A 0.0027270   3 1100              UPSTREAM MODIFIER          CODING     SAMD11 1:861261      21
11:   1 861261 rs144896029     G         A 0.0027270   3 1100            DOWNSTREAM MODIFIER          CODING AL645608.1 1:861261      22
12:   1 861332                 G         A 0.0009074   1 1102 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING AL645608.1 1:861332      11
13:   1 861332                 G         A 0.0009074   1 1102 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING     SAMD11 1:861332      11
14:   1 861332                 G         A 0.0009074   1 1102 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING     SAMD11 1:861332      11
15:   1 861332                 G         A 0.0009074   1 1102 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING     SAMD11 1:861332      11
16:   1 861332                 G         A 0.0009074   1 1102              UPSTREAM MODIFIER          CODING     SAMD11 1:861332      21
17:   1 865455                 C         G 0.0033190   3  904              UPSTREAM MODIFIER          CODING     SAMD11 1:865455      21
18:   1 865628  rs41285790     G         A 0.0027780   3 1080 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING     SAMD11 1:865628      11
19:   1 865628  rs41285790     G         A 0.0027780   3 1080 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING     SAMD11 1:865628      11
20:   1 865628  rs41285790     G         A 0.0027780   3 1080 NON_SYNONYMOUS_CODING MODERATE MISSENSE CODING     SAMD11 1:865628      11
21:   1 865628  rs41285790     G         A 0.0027780   3 1080     SYNONYMOUS_CODING      LOW   SILENT CODING AL645608.1 1:865628      14
22:   1 865628  rs41285790     G         A 0.0027780   3 1080              UPSTREAM MODIFIER          CODING     SAMD11 1:865628      21
23:   1 866437 rs139076934     C         T 0.0009074   1 1102     SYNONYMOUS_CODING      LOW   SILENT CODING AL645608.1 1:866437      14
24:   1 866437 rs139076934     C         T 0.0009074   1 1102     SYNONYMOUS_CODING      LOW   SILENT CODING     SAMD11 1:866437      14
25:   1 866437 rs139076934     C         T 0.0009074   1 1102     SYNONYMOUS_CODING      LOW   SILENT CODING     SAMD11 1:866437      14
26:   1 866437 rs139076934     C         T 0.0009074   1 1102     SYNONYMOUS_CODING      LOW   SILENT CODING     SAMD11 1:866437      14
27:   1 866461 rs148884928     G         A 0.0009074   1 1102     SYNONYMOUS_CODING      LOW   SILENT CODING     SAMD11 1:866461      14
28:   1 866461 rs148884928     G         A 0.0009074   1 1102     SYNONYMOUS_CODING      LOW   SILENT CODING     SAMD11 1:866461      14
29:   1 866461 rs148884928     G         A 0.0009074   1 1102     SYNONYMOUS_CODING      LOW   SILENT CODING     SAMD11 1:866461      14
30:   1 866461 rs148884928     G         A 0.0009074   1 1102              UPSTREAM MODIFIER          CODING AL645608.1 1:866461      21
31:   1 866511  rs71576583 CCCCT CCCCTCCCT 1.0000000 148  148              UPSTREAM MODIFIER          CODING AL645608.1 1:866511      21
32:   1 871057                 C         T 0.0009074   1 1102              UPSTREAM MODIFIER          CODING     SAMD11 1:871057      21
33:   1 871057                 C         T 0.0009074   1 1102              UPSTREAM MODIFIER          CODING AL645608.1 1:871057      21
34:   1 871057                 C         T 0.0009074   1 1102              UPSTREAM MODIFIER          CODING     SAMD11 1:871057      21
35:   1 871215  rs28419423     C         G 0.0036300   4 1102     SYNONYMOUS_CODING      LOW   SILENT CODING     SAMD11 1:871215      14
36:   1 871215  rs28419423     C         G 0.0036300   4 1102     SYNONYMOUS_CODING      LOW   SILENT CODING     SAMD11 1:871215      14
37:   1 871215  rs28419423     C         G 0.0036300   4 1102              UPSTREAM MODIFIER          CODING     SAMD11 1:871215      21
38:   1 871215  rs28419423     C         G 0.0036300   4 1102              UPSTREAM MODIFIER          CODING     SAMD11 1:871215      21
39:   1 871215  rs28419423     C         G 0.0036300   4 1102              UPSTREAM MODIFIER          CODING AL645608.1 1:871215      21
40:   1 871215  rs28419423     C         G 0.0036300   4 1102            DOWNSTREAM MODIFIER          CODING     SAMD11 1:871215      22
41:   1 871287                 C         G 0.0009107   1 1098              UPSTREAM MODIFIER          CODING     SAMD11 1:871287      21
42:   1 871287                 C         G 0.0009107   1 1098              UPSTREAM MODIFIER          CODING     SAMD11 1:871287      21
43:   1 871287                 C         G 0.0009107   1 1098              UPSTREAM MODIFIER          CODING AL645608.1 1:871287      21
44:   1 871287                 C         G 0.0009107   1 1098            DOWNSTREAM MODIFIER          CODING     SAMD11 1:871287      22
45:   1 871334   rs4072383     G         T 0.6680000 474  710              UPSTREAM MODIFIER          CODING     SAMD11 1:871334      21
46:   1 871334   rs4072383     G         T 0.6680000 474  710              UPSTREAM MODIFIER          CODING     SAMD11 1:871334      21
47:   1 871334   rs4072383     G         T 0.6680000 474  710              UPSTREAM MODIFIER          CODING AL645608.1 1:871334      21
48:   1 871334   rs4072383     G         T 0.6680000 474  710            DOWNSTREAM MODIFIER          CODING     SAMD11 1:871334      22
49:   1 874415  rs74047412     C         T 0.0018250   2 1096              UPSTREAM MODIFIER          CODING     SAMD11 1:874415      21
50:   1 874415  rs74047412     C         T 0.0018250   2 1096              UPSTREAM MODIFIER          CODING     SAMD11 1:874415      21
    chr     bp          ID   REF       ALT        AF  AC   AN                EFFECT   IMPACT FUNCLASS CODING       GENE      pos effRank

As you can see, the values in the many of the rows are repeats, for some of the columns. What I want to do is remove the duplicated rows, based on the value (the min) of the effRank variable. I have set the key to be chr, bp, and effRank. So the table should be sorted on the basis of those three columns. I got kind of close. The following command returns the rows that I want, but does not return all columns, which I want.

> test[,min(effRank), by=pos]
         pos V1
 1: 1:860416 21
 2: 1:860461 21
 3: 1:860521 21
 4: 1:861261 21
 5: 1:861332 11
 6: 1:865455 21
 7: 1:865628 11
 8: 1:866437 14
 9: 1:866461 14
10: 1:866511 21
11: 1:871057 21
12: 1:871215 14
13: 1:871287 21
14: 1:871334 21
15: 1:874415 21

All I need is a way to make the above command return all columns in the data.table, not just the ones mentioned in the expressions. Otherwise, works perfectly. Any help is appreciated. The output of dput is below, for those that with to make their own example.

Cheers, Davy

> dput(test)
structure(list(chr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), bp = c(860416L, 860416L, 860416L, 
860461L, 860461L, 860461L, 860521L, 860521L, 860521L, 861261L, 
861261L, 861332L, 861332L, 861332L, 861332L, 861332L, 865455L, 
865628L, 865628L, 865628L, 865628L, 865628L, 866437L, 866437L, 
866437L, 866437L, 866461L, 866461L, 866461L, 866461L, 866511L, 
871057L, 871057L, 871057L, 871215L, 871215L, 871215L, 871215L, 
871215L, 871215L, 871287L, 871287L, 871287L, 871287L, 871334L, 
871334L, 871334L, 871334L, 874415L, 874415L), ID = structure(c(10L, 
10L, 10L, 8L, 8L, 8L, 9L, 9L, 9L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 
1L, 7L, 7L, 7L, 7L, 7L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 11L, 
1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 6L, 6L, 6L, 
6L, 12L, 12L), .Label = c("", "rs139076934", "rs144896029", "rs148884928", 
"rs28419423", "rs4072383", "rs41285790", "rs57465118", "rs57924093", 
"rs61464428", "rs71576583", "rs74047412"), class = "factor"), 
    REF = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 
    1L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L), .Label = c("C", 
    "CCCCT", "G"), class = "factor"), ALT = structure(c(1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 
    1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 4L, 
    4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 
    4L, 4L, 4L), .Label = c("A", "CCCCTCCCT", "G", "T"), class = "factor"), 
    AF = c(0.5, 0.5, 0.5, 1, 1, 1, 0.984, 0.984, 0.984, 0.002727, 
    0.002727, 0.0009074, 0.0009074, 0.0009074, 0.0009074, 0.0009074, 
    0.003319, 0.002778, 0.002778, 0.002778, 0.002778, 0.002778, 
    0.0009074, 0.0009074, 0.0009074, 0.0009074, 0.0009074, 0.0009074, 
    0.0009074, 0.0009074, 1, 0.0009074, 0.0009074, 0.0009074, 
    0.00363, 0.00363, 0.00363, 0.00363, 0.00363, 0.00363, 0.0009107, 
    0.0009107, 0.0009107, 0.0009107, 0.668, 0.668, 0.668, 0.668, 
    0.001825, 0.001825), AC = c(14L, 14L, 14L, 62L, 62L, 62L, 
    61L, 61L, 61L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 
    3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 148L, 1L, 1L, 1L, 
    4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 474L, 474L, 474L, 
    474L, 2L, 2L), AN = c(28L, 28L, 28L, 62L, 62L, 62L, 62L, 
    62L, 62L, 1100L, 1100L, 1102L, 1102L, 1102L, 1102L, 1102L, 
    904L, 1080L, 1080L, 1080L, 1080L, 1080L, 1102L, 1102L, 1102L, 
    1102L, 1102L, 1102L, 1102L, 1102L, 148L, 1102L, 1102L, 1102L, 
    1102L, 1102L, 1102L, 1102L, 1102L, 1102L, 1098L, 1098L, 1098L, 
    1098L, 710L, 710L, 710L, 710L, 1096L, 1096L), EFFECT = structure(c(4L, 
    4L, 1L, 4L, 4L, 1L, 4L, 4L, 1L, 4L, 1L, 2L, 2L, 2L, 2L, 4L, 
    4L, 2L, 2L, 2L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 
    4L, 4L, 4L, 3L, 3L, 4L, 4L, 4L, 1L, 4L, 4L, 4L, 1L, 4L, 4L, 
    4L, 1L, 4L, 4L), .Label = c("DOWNSTREAM", "NON_SYNONYMOUS_CODING", 
    "SYNONYMOUS_CODING", "UPSTREAM"), class = "factor"), IMPACT = structure(c(3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 
    3L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 
    3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L), .Label = c("LOW", "MODERATE", "MODIFIER"
    ), class = "factor"), FUNCLASS = structure(c(1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 
    2L, 2L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 
    1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L), .Label = c("", "MISSENSE", "SILENT"), class = "factor"), 
    CODING = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CODING", class = "factor"), 
    GENE = structure(c(2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
    2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("AL645608.1", 
    "SAMD11"), class = "factor"), pos = structure(c(1L, 1L, 1L, 
    2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 7L, 
    7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 11L, 
    11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 
    14L, 14L, 14L, 14L, 15L, 15L), .Label = c("1:860416", "1:860461", 
    "1:860521", "1:861261", "1:861332", "1:865455", "1:865628", 
    "1:866437", "1:866461", "1:866511", "1:871057", "1:871215", 
    "1:871287", "1:871334", "1:874415"), class = "factor"), effRank = c(21L, 
    21L, 22L, 21L, 21L, 22L, 21L, 21L, 22L, 21L, 22L, 11L, 11L, 
    11L, 11L, 21L, 21L, 11L, 11L, 11L, 14L, 21L, 14L, 14L, 14L, 
    14L, 14L, 14L, 14L, 21L, 21L, 21L, 21L, 21L, 14L, 14L, 21L, 
    21L, 21L, 22L, 21L, 21L, 21L, 22L, 21L, 21L, 21L, 22L, 21L, 
    21L)), .Names = c("chr", "bp", "ID", "REF", "ALT", "AF", 
"AC", "AN", "EFFECT", "IMPACT", "FUNCLASS", "CODING", "GENE", 
"pos", "effRank"), row.names = c(NA, -50L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000000004260788>, sorted = c("chr", 
"bp", "effRank"))

Upvotes: 5

Views: 418

Answers (1)

Arun
Arun

Reputation: 118779

You can use the internal variable .I, which gives the row number. Then subset using those values, as follows:

DT[DT[, .I[which.min(effRank)], pos]$V1]

It's easier to understand if you write it in two lines as follows:

tmp <- DT[, .I[which.min(effRank)], pos]
DT[tmp$V1]

The first line generates a column V1 with all the row numbers of the minimum positions (from your j expression) grouped by pos. Then you just subset them.

Upvotes: 8

Related Questions