arteteco
arteteco

Reputation: 79

remove rows if values exists with the same combination in different columns

I have a 410 DNA sequences that I have confronted with each other, to get the similarity.

Now, to trim the database, I should get rid of the row that have the same value in 2 columns, because of course every value will be double.

To make myself clear, I have something like

tribble(
  ~seq01, ~seq02, ~ similarity,
  "a",   "b", 100.000,
  "b",   "a", 100.000,
  "c",   "d", 99.000,
  "d",   "c", 99.000,
)

comparing a-b and b-a is the same thing, so I'd want to get rid of the double value

What I want to end up with is

tribble(
  ~seq01, ~seq02, ~ similarity,
  "a",   "b", 100.000,
  "c",   "d", 99.000
)

I am not sure on how to proceed, all the ways I thought of are kinda hacky. I checked other answers, but don't really satisfy me.

Any input would be greatly appreciated (but tidy inputs are even more appreciated!)

Upvotes: 0

Views: 80

Answers (2)

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Another, base R, approach:

df$add1 <- apply(df[,1:2], 1, min)  # find rowwise minimum values 
df$add2 <- apply(df[,1:2], 1, max)  # find rowwise maximum values 
df <- df[!duplicated(df[,4:5]),]    # remove rows with identical values in new col's
df[,4:5] <- NULL                    # remove auxiliary col's

Result:

df
# A tibble: 2 x 3
  seq01 seq02 similarity
  <chr> <chr>      <dbl>
1 a     b            100
2 c     d             99

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

We can use pmin and pmax to sort the values and then use distinct to select unique rows.

library(dplyr)

df %>%
  mutate(col1 = pmin(seq01, seq02),
         col2 = pmax(seq01, seq02), .before = 1) %>%
  distinct(col1, col2, similarity)

#  col1  col2  similarity
#  <chr> <chr>      <dbl>
#1 a     b            100
#2 c     d             99  

Upvotes: 4

Related Questions