Agaz Wani
Agaz Wani

Reputation: 5684

Removing duplicates from a data frame, very fast

I have a data.frame with 50,000 rows, with some duplicates, which I would like to remove.

    SYMBOL          INTENSITY CALL          VALUE
1     DDR1          2464.3023    P  0.00016023613
2     RFC2           496.5190    P   0.0193034606
3    HSPA6           733.4763    P   0.0008046637
4     PAX8          2138.2882    P   0.0005617505
5   GUCA1A           176.3272    A   0.1896873022
6     UBA7           729.6157    P   0.0170004527
7     DDR1          2464.3023    P   0.0016023613
8     RFC2           496.5190    P   0.0593034606
9    HSPA9           733.4763    P   0.0008046637
10    PAX8          2138.2882    P     0.15617505
11 GUCA1A2           176.3272    A  0.01896873022
12    UBA8           729.6157    P   0.0170004527

I used following to remove the duplicates. I kept the one with minimum "VALUE" in the fourth column.

dt <- data.table(df)   
WoDuplic <- dt[,.SD[which.min(VALUE)],by=list(SYMBOL)]

It serves the purpose, but is very slow, it takes approximatively 10 seconds to remove the duplicates from a data.frame of above dimension. Is there any way to make the process faster?

Edited: Output looks like

SYMBOL          INTENSITY CALL          VALUE
1     DDR1          2464.3023    P  0.00016023613
2     RFC2           496.5190    P   0.0193034606
3    HSPA6           733.4763    P   0.0008046637
4     PAX8          2138.2882    P   0.0005617505
5   GUCA1A           176.3272    A   0.1896873022
6     UBA7           729.6157    P   0.0170004527


9    HSPA9           733.4763    P   0.0008046637

11 GUCA1A2           176.3272    A  0.01896873022
12    UBA8           729.6157    P   0.0170004527

Upvotes: 3

Views: 777

Answers (2)

akrun
akrun

Reputation: 886938

We could get the index of rows (.I[which.min(..)) that have minimum 'VALUE' for each 'SYMBOL' and use that column ('V1') to subset the dataset.

library(data.table)
dt[dt[,.I[which.min(VALUE)],by=list(SYMBOL)]$V1]

Or as @DavidArenburg mentioned, using setkey would be more efficient (although I am not sure why you get error with the original data)

 setkey(dt, VALUE) 
 indx <- dt[,.I[1L], by = SYMBOL]$V1 
 dt[indx]

Upvotes: 6

Roman
Roman

Reputation: 17648

You can use aggregate and merge to solve the problem. It should be very fast.

create an example data.frame

set.seed(123)
df <- data.frame(gene=rep(letters[1:20],2500),INTENSITY=1:50000,value=runif(50000))

get the min value for each gene

mins <- aggregate(value ~ gene, data = df, FUN = min)

and merge the additional columns

df.min <- merge(mins, df)

Upvotes: 0

Related Questions