Reputation: 79
I have a 410 DNA sequences that I have confronted with each other, to get the similarity.
Now, to trim the database, I should get rid of the row that have the same value in 2 columns, because of course every value will be double.
To make myself clear, I have something like
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"b", "a", 100.000,
"c", "d", 99.000,
"d", "c", 99.000,
)
comparing a-b and b-a is the same thing, so I'd want to get rid of the double value
What I want to end up with is
tribble(
~seq01, ~seq02, ~ similarity,
"a", "b", 100.000,
"c", "d", 99.000
)
I am not sure on how to proceed, all the ways I thought of are kinda hacky. I checked other answers, but don't really satisfy me.
Any input would be greatly appreciated (but tidy inputs are even more appreciated!)
Upvotes: 0
Views: 80
Reputation: 21400
Another, base R
, approach:
df$add1 <- apply(df[,1:2], 1, min) # find rowwise minimum values
df$add2 <- apply(df[,1:2], 1, max) # find rowwise maximum values
df <- df[!duplicated(df[,4:5]),] # remove rows with identical values in new col's
df[,4:5] <- NULL # remove auxiliary col's
Result:
df
# A tibble: 2 x 3
seq01 seq02 similarity
<chr> <chr> <dbl>
1 a b 100
2 c d 99
Upvotes: 0
Reputation: 388982
We can use pmin
and pmax
to sort the values and then use distinct
to select unique rows.
library(dplyr)
df %>%
mutate(col1 = pmin(seq01, seq02),
col2 = pmax(seq01, seq02), .before = 1) %>%
distinct(col1, col2, similarity)
# col1 col2 similarity
# <chr> <chr> <dbl>
#1 a b 100
#2 c d 99
Upvotes: 4