Remove Duplicates, but Keep the Most Complete Iteration

Question

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed. For example,

Original <- data.frame(id= c(1,2,2,3,3,4,5,5), 
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7), 
v5=c(1,NA,5,5,NA,5,NA,7))

The output would be the following:

Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))

My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.

So far, my thoughts have been to use grepl and determine where "anything" is present

Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))

Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.

CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)

This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.

Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.

If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!

Dave2e · Accepted Answer

Here is a solution. It is not very pretty but it should work for your application:

#Order by the degree of completeness    
Original<-Original[order(CompleteNess),]

#Starting from the bottom select the not duplicated rows 
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]

This does rearrange your original data frame so beware if there is additional processing later on.

Remove Duplicates, but Keep the Most Complete Iteration

Answers (2)

Related Questions