Reputation: 7121
I am working with a big dataframe, that looks like this one
X V1 V2 V3
1 IOJRE.no.m-1 10 2 0
2 IOJRE.no.m-2 10 7 0
3 OHIJR.no.m-4 5 5 1
4 OHIJR.no.m-6 5 6 1
5 OHIJR.no.m-3 3 3 1
6 OHIJR.no.m-1 2 3 1
7 ABCDE.no.m-1 5 6 0
8 ABCDE.no.m-5 1 4 0
> dput(mydata)
structure(list(X = structure(c(3L, 4L, 7L, 8L, 6L, 5L, 1L, 2L
), .Label = c("ABCDE.no.m-1", "ABCDE.no.m-5", "IOJRE.no.m-1",
"IOJRE.no.m-2", "OHIJR.no.m-1", "OHIJR.no.m-3", "OHIJR.no.m-4",
"OHIJR.no.m-6"), class = "factor"), V1 = c(10, 10, 5, 5, 3, 2,
5, 1), V2 = c(2, 7, 5, 6, 3, 3, 6, 4), V3 = c(0, 0, 1, 1, 1,
1, 0, 0)), .Names = c("X", "V1", "V2", "V3"), row.names = c(NA,
8L), class = "data.frame")
The first 5 letters in many items in the first column are similar. I am trying to subset the data, keeping only one of these similar items, which has the highest V1 value. (if the V1 value is the same, it doesn't matter which one to keep for me).
I cannot think of a proper command because this is not a duplicated()
. I thought of using aggregate()
then which.max()
But I really cannot construct a proper command for the job.
Here's the output I am looking to have.
X V1 V2 V3
1 IOJRE.no.m-1 10 2 0
2 OHIJR.no.m-4 5 5 1
3 ABCDE.no.m-1 5 6 0
> dput(mydata2)
structure(list(X = structure(c(2L, 3L, 1L), .Label = c("ABCDE.no.m-1",
"IOJRE.no.m-1", "OHIJR.no.m-4"), class = "factor"), V1 = c(10L,
5L, 5L), V2 = c(2L, 5L, 6L), V3 = c(0L, 1L, 0L)), .Names = c("X",
"V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -3L
))
Can anyone help me with this?
Many thanks,
Upvotes: 2
Views: 679
Reputation: 3291
Here's a data.table approach (probably not the best, but it works and is fast)
library("data.table")
mydata <- data.table(mydata)
mydata[,X:=strtrim(X,5)]
setkey(mydata,X,V1)
mydata[unique(X),mult="last"]
X V1 V2 V3
1: ABCDE 5 6 0
2: IOJRE 10 7 0
3: OHIJR 5 6 1
Upvotes: 2
Reputation: 633
Probably not the fastest way to do that, but it answers your question :)
1) Substring your 1st variable:
substr(x$X,1,5)
2) Find max value for each group of substring
y <- lapply(split(x, substr(x$X,1,5)), function(x) {
x[which.max(x$V1),]})
3) Create new dataframe from results:
do.call(rbind, y)
Upvotes: 2