Reputation: 10996
I have a dataframe like this:
df = data.frame(x = c("1_1_1", "2_1_1", "3_1_1"),
y = c("1_1_1", "1_1_1", "1_1_1"),
z = c("1_1_1", "4_1_1", "1_1_1"))
I now want to dynamically go through all my columns and check if any two columns have the same values per row. So I want to compare x with y, x with z and y with z. Please note that I have much more columns in reality.
The desired outcome is to remove every row where there is at least one duplicated value, i.e. in my example I want to remove row 1 (because all values are identical) and row 3 (because y and z are identical).
Maybe another note: in reality my data frame has ~30m rows.
I know that there's the duplicated
or anyDuplicated
function, but AFAIK these assume that I check for duplicates in all columns at the same time whereas I want to have it based on pair-wise column comparisons.
EDIT: somewhat related to this problem (and not sure if that might make things easier): I create the df df
in a way where I have a character variable like x=c("1_1_1", "2_1_1", "3_1_1", "1_2_1")
and so on and then I use the expand.grid
function:
df = expand.grid(x, x, x)
which results in the duplicates in the first place. So ideally, I would create the df
data frame in a way that prevents these duplicates right from the beginning?
Upvotes: 0
Views: 411
Reputation: 388907
To solve the first issue where you create duplicates using expand.grid
, we can use combn
instead which will give us combinations without duplicates
combn(x, 3, simplify = FALSE)
#[[1]]
#[1] "1_1_1" "2_1_1" "3_1_1"
#[[2]]
#[1] "1_1_1" "2_1_1" "1_2_1"
#[[3]]
#[1] "1_1_1" "3_1_1" "1_2_1"
#[[4]]
#[1] "2_1_1" "3_1_1" "1_2_1"
Now what we need is to create rows with various permutations of this combination. We can manually write a function to create this permutations or use from one of the packages. Here I am using combinat::permn
.
do.call(rbind.data.frame, combn(x, 3, simplify = FALSE, FUN = function(y)
do.call(rbind, combinat::permn(y))))
# V1 V2 V3
#1 1_1_1 2_1_1 3_1_1
#2 1_1_1 3_1_1 2_1_1
#3 3_1_1 1_1_1 2_1_1
#4 3_1_1 2_1_1 1_1_1
#5 2_1_1 3_1_1 1_1_1
#6 2_1_1 1_1_1 3_1_1
#....
To remove duplicated
rows we can do
df[!apply(df, 1, function(x) any(duplicated(x))), ]
Upvotes: 2