Reputation: 3461
After bind_rows()
a number of large data.frames, i end up with a data.frame like this:
tmp <- data.frame(Query=c("A", "B", "C", "D", "A"), target=c("D", "A", "A", "A", "B"), values=runif(5))
tmp
Query target values
1 A D 0.06075322
2 B A 0.43179750
3 C A 0.32325309
4 D A 0.26714620
5 A B 0.96854999
I need to remove all rows which contain combinations of Query
and target
, that have appeared before, in either direction (AxD is a duplicate of DxA). In the example, the desired output would be (since row 4 is a duplicate of row 1, and row 5 a duplicate of row 2)
tmp
Query target values
1 A D 0.06075322
2 B A 0.43179750
3 C A 0.32325309
thank you very much!
Upvotes: 2
Views: 330
Reputation: 887118
Using vectorized
pmin/pmax
subset(tmp, !duplicated(cbind(pmin(Query, target), pmax(Query, target))))
Query target values
1 A D 0.06075322
2 B A 0.43179750
3 C A 0.32325309
Upvotes: 3
Reputation: 8880
tidyverse
tmp <- data.frame(Query=c("A", "B", "C", "D", "A"), target=c("D", "A", "A", "A", "B"), values=runif(5))
tmp
#> Query target values
#> 1 A D 0.4596637
#> 2 B A 0.1274885
#> 3 C A 0.2051829
#> 4 D A 0.4037819
#> 5 A B 0.1777751
library(tidyverse)
tmp %>%
rowwise() %>%
mutate(fltr = str_c(sort(c_across(c("Query", "target"))), collapse = "")) %>%
distinct(fltr, .keep_all = TRUE) %>%
select(-fltr) %>%
ungroup()
#> # A tibble: 3 x 3
#> Query target values
#> <chr> <chr> <dbl>
#> 1 A D 0.460
#> 2 B A 0.127
#> 3 C A 0.205
Created on 2023-02-28 with reprex v2.0.2
Upvotes: 4
Reputation: 51994
sort
the selected columns and discard duplicated
rows:
cols = c("Query", "target")
tmp[!duplicated(t(apply(tmp[cols], 1, sort))), ]
# Query target values
#1 A D 0.7205899
#2 B A 0.5484203
#3 C A 0.4503456
Upvotes: 6