Seth W. Bigelow
Seth W. Bigelow

Reputation: 61

How do I eliminate duplicates in data.table using mult

The solution to this simple problem has eluded me for several hours. I have a data table in which a value is identified by several classification variables (A, B, L). Where there are observations characterized by duplicate classification variables A & B, I want to retain the one that has the highest 'L'. So, if I have a table generated with this code

set.seed(17)
DT <- data.table(A=rep(c("a","b"),each=5), 
    B=c("a","b","c","d","d","a","b","b","c","d"), 
    L=c(1,1,1,2,1,1,1,2,1,1),
    val=rnbinom(10, size=2, mu=3))

Making the following:

A B L val
 1: a a 1   1
 2: a b 1  10
 3: a c 1   3
 4: a d 1   5
 5: a d 2   2
 6: b a 1   8
 7: b b 1   7
 8: b b 2   1
 9: b c 1   2
10: b d 1   2

I have tried commands such as

setkey(DT,A,B,L)
DT[ , .(A,B,L,val)  , mult="last"] 

but I'm just not getting something. I want a resulting table that looks like this

 A B L val
 1: a a 1   1
 2: a b 1  10
 3: a c 1   3
 5: a d 2   2
 6: b a 1   8
 8: b b 2   1
 9: b c 1   2
10: b d 1   2

Upvotes: 1

Views: 70

Answers (4)

user12728748
user12728748

Reputation: 8516

DT[, lapply(.SD, last), .(A,B)])

should also work and seems to be a bit faster than the merge solution

Upvotes: 1

pseudospin
pseudospin

Reputation: 2777

Here's how I'd do it (without mult)

DT[order(-L), .SD[1], .(A,B)]

With mult something like this would do it - note that Im doing an actual join here

DT[order(L)][unique(DT[, .(A, B)]), on = c('A', 'B'), mult = 'last']
#>    A B L val
#> 1: a a 1   1
#> 2: a b 1   1
#> 3: a c 1   3
#> 4: a d 2  12
#> 5: b a 1   6
#> 6: b b 2   2
#> 7: b c 1   3
#> 8: b d 1   5

Upvotes: 0

Will Hipson
Will Hipson

Reputation: 374

set.seed(17)
library(data.table)
DT <- data.table(A=rep(c("a","b"),each=5), 
                 B=c("a","b","c","d","d","a","b","b","c","d"), 
                 L=c(1,1,1,2,1,1,1,2,1,1),
                 val=rnbinom(10, size=2, mu=3))

result <- DT[DT[, .I[L == max(L)], by = list(A, B)]$V1]

> result
   A B L val
1: a a 1   1
2: a b 1   1
3: a c 1   3
4: a d 2  12
5: b a 1   6
6: b b 2   2
7: b c 1   3
8: b d 1   5

Upvotes: 0

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

solution option

library(data.table)

dt <- structure(list(A = c("a", "a", "a", "a", "a", "b", "b", "b", 
                           "b", "b"), B = c("a", "b", "c", "d", "d", "a", "b", "b", "c", 
                                            "d"), L = c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L), val = c(1L, 
                                                                                                         10L, 3L, 5L, 2L, 8L, 7L, 1L, 2L, 2L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                    -10L))

setDT(dt)                                                                                                                 
merge(dt[, list(L = last(L)), by =list(A, B)], dt)
#>    A B L val
#> 1: a a 1   1
#> 2: a b 1  10
#> 3: a c 1   3
#> 4: a d 2   2
#> 5: b a 1   8
#> 6: b b 2   1
#> 7: b c 1   2
#> 8: b d 1   2

Created on 2021-03-24 by the reprex package (v1.0.0)

Upvotes: 0

Related Questions