Reputation: 97
Many previous questions highlight various ways to remove duplicate rows with missing values, however none deal with the following case. Example starting data:
df <- data.frame(x = c(1, NA, 1), y=c(NA, 1, 1), z=c(0, NA, NA))
print(df)
Desired output:
df2 <- data.frame(x = c(1, 1), y=c(NA, 1), z=c(0, NA))
print(df2)
In this case the second row was removed because it was a perfect subset of row 3. In the real application I want to remove rows that contain all redundant info in non-missing columns, and keep the row that has less missing overall.
I thought this might be accomplished using dplyr and a rowwise application of distinct(), but to no avail. I could do this with a very slow for loop, but with hundreds of columns and thousands of rows this is a poor option.
Upvotes: 3
Views: 208
Reputation: 25225
Here is another option using data.table
:
library(data.table)
#convert into long format and discard NAs
mDT <- melt(setDT(df)[, rn := .I], id.var="rn", na.rm=TRUE)[, cnt := .N , rn]
#self join and filter for rows that match to other rows
merged <- mDT[mDT, on=.(variable, value), {
diffrow <- i.rn!=x.rn
.(irn=i.rn[diffrow], xrn=x.rn[diffrow], icnt=i.cnt[diffrow])
}]
#count the occurrence and delete rows where all values are matched to another row
ix <- merged[, xcnt := .N, .(irn, xrn)][
icnt==xcnt]$irn
#delete dupe rows
df[-ix]
Upvotes: 1
Reputation: 5232
I'm not sure how to do it with dplyr, but here is soultion with loop. Also I'm not sure that dplyr solution can be faster than loop one (at the end it must use some loop), here you can at least control loop flow.
Subset vector function determines if vector a is subset of vector b (return 1) or if vector b is subset of vector a (returns 2) otherwise it returns 0. Then I loop over all rows of data.frame and remove subset rows.
subsetVector <- function(a, b){
na_a <- which(is.na(a))
na_b <- which(is.na(b))
if(all(na_a %in% na_b)){
if(all(a[-na_b] == b[-na_b])) return(2)
}else if(all(na_b %in% na_a)){
if(all(b[-na_a] == a[-na_a])) return(1)
}
return(0)
}
i <- 1
while(i < nrow(df)){
remove_rows <- NULL
for(j in (i+1):nrow(df)){
p <- subsetVector(df[i,], df[j,])
if(p == 1){
remove_rows <- c(remove_rows, i)
break()
}else if(p == 2){
remove_rows <- c(remove_rows, j)
}
}
if(length(remove_rows) > 0)
df <- df[-remove_rows,]
if(!1 %in% remove_rows)
i <- i + 1
}
Upvotes: 1