Reputation: 5684
I have a data.frame with 50,000 rows, with some duplicates, which I would like to remove.
SYMBOL INTENSITY CALL VALUE
1 DDR1 2464.3023 P 0.00016023613
2 RFC2 496.5190 P 0.0193034606
3 HSPA6 733.4763 P 0.0008046637
4 PAX8 2138.2882 P 0.0005617505
5 GUCA1A 176.3272 A 0.1896873022
6 UBA7 729.6157 P 0.0170004527
7 DDR1 2464.3023 P 0.0016023613
8 RFC2 496.5190 P 0.0593034606
9 HSPA9 733.4763 P 0.0008046637
10 PAX8 2138.2882 P 0.15617505
11 GUCA1A2 176.3272 A 0.01896873022
12 UBA8 729.6157 P 0.0170004527
I used following to remove the duplicates. I kept the one with minimum "VALUE" in the fourth column.
dt <- data.table(df)
WoDuplic <- dt[,.SD[which.min(VALUE)],by=list(SYMBOL)]
It serves the purpose, but is very slow, it takes approximatively 10 seconds to remove the duplicates from a data.frame of above dimension. Is there any way to make the process faster?
Edited: Output looks like
SYMBOL INTENSITY CALL VALUE
1 DDR1 2464.3023 P 0.00016023613
2 RFC2 496.5190 P 0.0193034606
3 HSPA6 733.4763 P 0.0008046637
4 PAX8 2138.2882 P 0.0005617505
5 GUCA1A 176.3272 A 0.1896873022
6 UBA7 729.6157 P 0.0170004527
9 HSPA9 733.4763 P 0.0008046637
11 GUCA1A2 176.3272 A 0.01896873022
12 UBA8 729.6157 P 0.0170004527
Upvotes: 3
Views: 777
Reputation: 886938
We could get the index of rows (.I[which.min(..)
) that have minimum 'VALUE' for each 'SYMBOL' and use that column ('V1') to subset the dataset.
library(data.table)
dt[dt[,.I[which.min(VALUE)],by=list(SYMBOL)]$V1]
Or as @DavidArenburg mentioned, using setkey
would be more efficient (although I am not sure why you get error with the original data)
setkey(dt, VALUE)
indx <- dt[,.I[1L], by = SYMBOL]$V1
dt[indx]
Upvotes: 6
Reputation: 17648
You can use aggregate and merge to solve the problem. It should be very fast.
create an example data.frame
set.seed(123)
df <- data.frame(gene=rep(letters[1:20],2500),INTENSITY=1:50000,value=runif(50000))
get the min value for each gene
mins <- aggregate(value ~ gene, data = df, FUN = min)
and merge the additional columns
df.min <- merge(mins, df)
Upvotes: 0