R tidyverse delete rows that have identical values in dynamic number of columns

Question

I have a dataframe like this:

df = data.frame(x = c("1_1_1", "2_1_1", "3_1_1"),
                y = c("1_1_1", "1_1_1", "1_1_1"),
                z = c("1_1_1", "4_1_1", "1_1_1"))

I now want to dynamically go through all my columns and check if any two columns have the same values per row. So I want to compare x with y, x with z and y with z. Please note that I have much more columns in reality.

The desired outcome is to remove every row where there is at least one duplicated value, i.e. in my example I want to remove row 1 (because all values are identical) and row 3 (because y and z are identical).

Maybe another note: in reality my data frame has ~30m rows.

I know that there's the duplicated or anyDuplicated function, but AFAIK these assume that I check for duplicates in all columns at the same time whereas I want to have it based on pair-wise column comparisons.

EDIT: somewhat related to this problem (and not sure if that might make things easier): I create the df df in a way where I have a character variable like x=c("1_1_1", "2_1_1", "3_1_1", "1_2_1") and so on and then I use the expand.grid function:

df = expand.grid(x, x, x)

which results in the duplicates in the first place. So ideally, I would create the df data frame in a way that prevents these duplicates right from the beginning?

Ronak Shah · Accepted Answer

To solve the first issue where you create duplicates using expand.grid, we can use combn instead which will give us combinations without duplicates

combn(x, 3, simplify = FALSE)

#[[1]]
#[1] "1_1_1" "2_1_1" "3_1_1"

#[[2]]
#[1] "1_1_1" "2_1_1" "1_2_1"

#[[3]]
#[1] "1_1_1" "3_1_1" "1_2_1"

#[[4]]
#[1] "2_1_1" "3_1_1" "1_2_1"

Now what we need is to create rows with various permutations of this combination. We can manually write a function to create this permutations or use from one of the packages. Here I am using combinat::permn.

do.call(rbind.data.frame, combn(x, 3, simplify = FALSE, FUN = function(y) 
        do.call(rbind, combinat::permn(y))))

#      V1    V2    V3
#1  1_1_1 2_1_1 3_1_1
#2  1_1_1 3_1_1 2_1_1
#3  3_1_1 1_1_1 2_1_1
#4  3_1_1 2_1_1 1_1_1
#5  2_1_1 3_1_1 1_1_1
#6  2_1_1 1_1_1 3_1_1
#....

To remove duplicated rows we can do

df[!apply(df, 1, function(x) any(duplicated(x))), ]

R tidyverse delete rows that have identical values in dynamic number of columns

Answers (1)

Related Questions