R: remove duplicate rows with full overlap of non-missing variables

Question

Many previous questions highlight various ways to remove duplicate rows with missing values, however none deal with the following case. Example starting data:

df <- data.frame(x = c(1, NA, 1), y=c(NA, 1, 1), z=c(0, NA, NA))
print(df)

Desired output:

df2 <- data.frame(x = c(1, 1), y=c(NA, 1), z=c(0, NA))
print(df2)

In this case the second row was removed because it was a perfect subset of row 3. In the real application I want to remove rows that contain all redundant info in non-missing columns, and keep the row that has less missing overall.

I thought this might be accomplished using dplyr and a rowwise application of distinct(), but to no avail. I could do this with a very slow for loop, but with hundreds of columns and thousands of rows this is a poor option.

chinsoon12 · Accepted Answer

Here is another option using data.table:

library(data.table)
#convert into long format and discard NAs
mDT <- melt(setDT(df)[, rn := .I], id.var="rn", na.rm=TRUE)[, cnt := .N , rn]

#self join and filter for rows that match to other rows
merged <- mDT[mDT, on=.(variable, value), {
      diffrow <- i.rn!=x.rn
      .(irn=i.rn[diffrow], xrn=x.rn[diffrow], icnt=i.cnt[diffrow])
    }]

#count the occurrence and delete rows where all values are matched to another row
ix <- merged[, xcnt := .N, .(irn, xrn)][
    icnt==xcnt]$irn

#delete dupe rows
df[-ix]

R: remove duplicate rows with full overlap of non-missing variables

Answers (2)

Related Questions