Error404
Error404

Reputation: 7121

Partial duplicates in r

I am working with a big dataframe, that looks like this one

             X V1 V2 V3
1 IOJRE.no.m-1 10  2  0
2 IOJRE.no.m-2 10  7  0
3 OHIJR.no.m-4  5  5  1
4 OHIJR.no.m-6  5  6  1
5 OHIJR.no.m-3  3  3  1
6 OHIJR.no.m-1  2  3  1
7 ABCDE.no.m-1  5  6  0
8 ABCDE.no.m-5  1  4  0

> dput(mydata)
structure(list(X = structure(c(3L, 4L, 7L, 8L, 6L, 5L, 1L, 2L
), .Label = c("ABCDE.no.m-1", "ABCDE.no.m-5", "IOJRE.no.m-1", 
"IOJRE.no.m-2", "OHIJR.no.m-1", "OHIJR.no.m-3", "OHIJR.no.m-4", 
"OHIJR.no.m-6"), class = "factor"), V1 = c(10, 10, 5, 5, 3, 2, 
5, 1), V2 = c(2, 7, 5, 6, 3, 3, 6, 4), V3 = c(0, 0, 1, 1, 1, 
1, 0, 0)), .Names = c("X", "V1", "V2", "V3"), row.names = c(NA, 
8L), class = "data.frame")

The first 5 letters in many items in the first column are similar. I am trying to subset the data, keeping only one of these similar items, which has the highest V1 value. (if the V1 value is the same, it doesn't matter which one to keep for me).

I cannot think of a proper command because this is not a duplicated(). I thought of using aggregate() then which.max() But I really cannot construct a proper command for the job. Here's the output I am looking to have.

             X V1 V2 V3
1 IOJRE.no.m-1 10  2  0
2 OHIJR.no.m-4  5  5  1
3 ABCDE.no.m-1  5  6  0


> dput(mydata2)
structure(list(X = structure(c(2L, 3L, 1L), .Label = c("ABCDE.no.m-1", 
"IOJRE.no.m-1", "OHIJR.no.m-4"), class = "factor"), V1 = c(10L, 
5L, 5L), V2 = c(2L, 5L, 6L), V3 = c(0L, 1L, 0L)), .Names = c("X", 
"V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -3L
))

Can anyone help me with this?

Many thanks,

Upvotes: 2

Views: 679

Answers (2)

beginneR
beginneR

Reputation: 3291

Here's a data.table approach (probably not the best, but it works and is fast)

library("data.table")
mydata <- data.table(mydata)
mydata[,X:=strtrim(X,5)]
setkey(mydata,X,V1)
mydata[unique(X),mult="last"]

       X V1 V2 V3
1: ABCDE  5  6  0
2: IOJRE 10  7  0
3: OHIJR  5  6  1

Upvotes: 2

Asayat
Asayat

Reputation: 633

Probably not the fastest way to do that, but it answers your question :)

1) Substring your 1st variable:

substr(x$X,1,5)

2) Find max value for each group of substring

y <- lapply(split(x, substr(x$X,1,5)), function(x) {
x[which.max(x$V1),]})

3) Create new dataframe from results:

do.call(rbind, y)

Upvotes: 2

Related Questions