Madcap
Madcap

Reputation: 77

Remove Duplicates, but Keep the Most Complete Iteration

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed. For example,

Original <- data.frame(id= c(1,2,2,3,3,4,5,5), 
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7), 
v5=c(1,NA,5,5,NA,5,NA,7))

The output would be the following:

Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))

My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.

So far, my thoughts have been to use grepl and determine where "anything" is present

Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))

Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.

CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)

This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.

Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.

If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!

Upvotes: 4

Views: 515

Answers (2)

Dave2e
Dave2e

Reputation: 24079

Here is a solution. It is not very pretty but it should work for your application:

#Order by the degree of completeness    
Original<-Original[order(CompleteNess),]

#Starting from the bottom select the not duplicated rows 
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]

This does rearrange your original data frame so beware if there is additional processing later on.

Upvotes: 3

USER_1
USER_1

Reputation: 2469

You can aggregate your data and select the row with max score:

Original <- data.frame(id= c(1,2,2,3,3,4,5,5), 
                       key=c(1,2,2,3,3,4,5,5),
                       num=c(1,1,1,1,1,1,1,1),
                       v4= c(1,NA,5,5,NA,5,NA,7), 
                       v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))

#get the score 
Original$present <- rowSums(Present)

#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")

library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
      Max = max(present))

And if you want to keep the other columns, just do this:

Final <- ddply(Original,.(id.key.num),summarize,
      Max = max(present),
      v4 = v4[which.max(present)],
      v5 = v5[which.max(present)]
      )

Upvotes: 2

Related Questions