Ryan
Ryan

Reputation: 1068

How to remove rows that have the same value in two columns and rows that are duplicates of previous observations, regardless of order

Consider the data frame Data created here:

Data <- data.frame(Location = rep(letters[1:20], each = 10))

I want to do pairwise comparisons manually. First, I want to find all possible pairwise combinations between the levels of Location in Data, so I make the object Pairs like this:

Pairs <-expand.grid(unique(Data$Location),unique(Data$Location))

Now I want to remove rows from the object Pairs where Pairs$Var1 == Pairs$Var2, and I want to remove rows that are "duplicates" of previous pairs, but in the opposite order. In other words, I want to remove rows where Pairs$Var1 == a & Pairs$Var2 == a, Pairs$Var1 == b & Pairs$Var2 == b, and so on (i.e., I don't want to compare Locations to themselves), and I also don't want the same comparison to be made twice, so if Var1==a has already been compared (or already exists earlier in the data.frame) to Var2 == b, then I don't want to compare Var1 == b to Var2 == a, so I need to remove one of these combinations, but not the other (I hope that makes sense). How can I do this?

We can use Pairs[Pairs$Var1 == Pairs$Var2,] to see where Var1 == Var2, but this doesn't help with the second problem

Upvotes: 1

Views: 3334

Answers (3)

user2110417
user2110417

Reputation:

Your dataframe:

df <- data.frame(Location = rep(letters[1:20], each = 10))
df <-expand.grid(unique(df$Location),unique(df$Location))
dim(df) # 400 rows x 2 columns

Remove the duplicates:

df = df[!duplicated(t(apply(df, 1, sort))), ]
dim(df) # 210 rows x 2 columns

Remove Var1 = Var2

df = subset(df, df$Var1 != df$Var2)
dim(df) # 190 rows x 2 columns

Upvotes: 0

Bas
Bas

Reputation: 4658

A neat trick to make sure that two columns are different and that there are no symmetrical duplicates is to use the 'greater than' or 'smaller than' operator.

Pairs <- Pairs[as.character(Pairs$Var1) < as.character(Pairs$Var2),]

or, if you want the inverse,

Pairs <- Pairs[as.character(Pairs$Var1) >= as.character(Pairs$Var2),]

This works because a < a is False (an item is not less than itself), and for every pair (a, b) either a < b or b < a is True, the other is False. That way, for every such pair you only keep one.

Therefore as.character(Pairs$Var1) < as.character(Pairs$Var2) returns a vector of True and False that you can use to slice your data.frame. The as.character() is needed because one cannot compare factors using <.

Upvotes: 3

carlo_sguera
carlo_sguera

Reputation: 395

I post a solution which uses a for:

First, remove rows with the same value in both columns:

Pairs <- Pairs[Pairs$Var1 != Pairs$Var2,]

Second, remove "standard" duplicates:

Pairs <- Pairs[!duplicated(Pairs),]

Finally, remove duplicates that are in opposite order. My strategy involves creating a temporary columns that allows you (a) not to make the search for cases that you know already are duplicates; (b) make the final filtering. Then, you can remove the temporary column:

Pairs$my_duplicated <- FALSE
for(i in 1:nrow(Pairs)){
  if(Pairs$my_duplicated[i] == FALSE){
    my_test <- Pairs$Var2 %in% Pairs$Var1[i] & Pairs$Var1 %in% Pairs$Var2[i]
    Pairs$my_duplicated[my_test] <- TRUE
  }
}
Pairs <- Pairs[!Pairs$my_duplicated,]
Pairs$my_duplicated <- NULL

Upvotes: 1

Related Questions