Partial duplicates in r

Question

I am working with a big dataframe, that looks like this one

             X V1 V2 V3
1 IOJRE.no.m-1 10  2  0
2 IOJRE.no.m-2 10  7  0
3 OHIJR.no.m-4  5  5  1
4 OHIJR.no.m-6  5  6  1
5 OHIJR.no.m-3  3  3  1
6 OHIJR.no.m-1  2  3  1
7 ABCDE.no.m-1  5  6  0
8 ABCDE.no.m-5  1  4  0

> dput(mydata)
structure(list(X = structure(c(3L, 4L, 7L, 8L, 6L, 5L, 1L, 2L
), .Label = c("ABCDE.no.m-1", "ABCDE.no.m-5", "IOJRE.no.m-1", 
"IOJRE.no.m-2", "OHIJR.no.m-1", "OHIJR.no.m-3", "OHIJR.no.m-4", 
"OHIJR.no.m-6"), class = "factor"), V1 = c(10, 10, 5, 5, 3, 2, 
5, 1), V2 = c(2, 7, 5, 6, 3, 3, 6, 4), V3 = c(0, 0, 1, 1, 1, 
1, 0, 0)), .Names = c("X", "V1", "V2", "V3"), row.names = c(NA, 
8L), class = "data.frame")

The first 5 letters in many items in the first column are similar. I am trying to subset the data, keeping only one of these similar items, which has the highest V1 value. (if the V1 value is the same, it doesn't matter which one to keep for me).

I cannot think of a proper command because this is not a duplicated(). I thought of using aggregate() then which.max() But I really cannot construct a proper command for the job. Here's the output I am looking to have.

             X V1 V2 V3
1 IOJRE.no.m-1 10  2  0
2 OHIJR.no.m-4  5  5  1
3 ABCDE.no.m-1  5  6  0


> dput(mydata2)
structure(list(X = structure(c(2L, 3L, 1L), .Label = c("ABCDE.no.m-1", 
"IOJRE.no.m-1", "OHIJR.no.m-4"), class = "factor"), V1 = c(10L, 
5L, 5L), V2 = c(2L, 5L, 6L), V3 = c(0L, 1L, 0L)), .Names = c("X", 
"V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -3L
))

Can anyone help me with this?

Many thanks,

beginneR · Accepted Answer

Here's a data.table approach (probably not the best, but it works and is fast)

library("data.table")
mydata <- data.table(mydata)
mydata[,X:=strtrim(X,5)]
setkey(mydata,X,V1)
mydata[unique(X),mult="last"]

       X V1 V2 V3
1: ABCDE  5  6  0
2: IOJRE 10  7  0
3: OHIJR  5  6  1

Partial duplicates in r

Answers (2)

Related Questions