chupvl
chupvl

Reputation: 1298

R: removing duplicates while keeping the max value

for every row with duplicated var I need to take the max value of every column. I tried the plyr approach, but it's very-very slow. Any suggestions for optimization or alternatives?

x <- ddply(x, "var", numcolwise(max))

data example below, but I have around 10K rows and 5K.

var L1  L2  L3  L4  L5  L6
var1    0.509100101 0.015739581 0.957789164 0.902292389 0.826366893 0.182984811
var1    0.879927428 0.652231012 0.937680366 0.275712565 0.697747469 0.839823493
var1    0.708760238 0.650970415 0.342625412 0.945388553 0.780597336 0.125712247
var4    0.169279846 0.712183393 0.805263513 0.119002298 0.774175311 0.656468809
var5    0.343674627 0.808923336 0.339527067 0.73768585  0.166461354 0.588347206
var5    0.690868073 0.195300702 0.16614307  0.112241319 0.747591381 0.9433346
var5    0.209711182 0.764779757 0.283426455 0.373367398 0.225350503 0.492510721
var8    0.805516572 0.842888657 0.275147394 0.960842561 0.734521004 0.605078001
var9    0.534889726 0.444212963 0.165563287 0.590991951 0.244009571 0.188597899
var10   0.499110864 0.496455884 0.993483606 0.444440833 0.755452203 0.895849025
var11   0.366111435 0.580177811 0.411523103 0.111694655 0.892282071 0.777860302
var12   0.200600203 0.809859039 0.890912839 0.063123472 0.107956515 0.441015454
var13   0.46876789  0.529304262 0.410794171 0.072144288 0.497160109 0.291629322
var14   0.546108044 0.436783087 0.315849023 0.610337503 0.552454722 0.50320375
var14   0.206252562 0.919754189 0.060999717 0.808003665 0.441351656 0.190068706

Upvotes: 2

Views: 1536

Answers (2)

krlmlr
krlmlr

Reputation: 25444

Melt the data into "long" format, order by decreasing value and take the first instance of each combination of var and variable:

library(magrittr)
library(plyr)
library(reshape2)
data %>%
  melt(id.vars = "var") %>%
  arrange(-value) %>%
  subset(!duplicated(interaction(var, value)))

If necessary, convert the data back into "wide" format:

... %>%
dcast(var~variable)

Upvotes: 0

akrun
akrun

Reputation: 887088

You could try

library(dplyr)
 x %>% 
    group_by(var) %>% 
    summarise_each(funs(max))

Or

library(data.table)
setDT(x)[, lapply(.SD, max) , var]

Upvotes: 4

Related Questions