Simon
Simon

Reputation: 25

Remove duplicates of concetanated values without order in R

Let's say I have a character column in a data frame that contains every double combination of a, b and c, like so

dat <- data.frame(V1 = c("a_a","a_b","a_c","b_a","b_b","b_c","c_a","c_b","c_c"))

However I do not care about order and so would like to remove the duplicates b_a, c_a and c_b as I already have a_b, a_c and b_c.

dat <- data.frame(V1 = c("a_a","a_b","a_c","b_b","b_c","c_c"))

I usually use dplyr for data wrangling purposes, but I fail to see how dplyr::distinct() could achieve this.

I am of course happy to consider any (non-dplyr) solution. Thanks!

Upvotes: 0

Views: 381

Answers (3)

steveb
steveb

Reputation: 5532

You could do the following using dplyr and stringr:

dat %>%
  mutate(newval = unlist(
                    lapply(stringr::str_split(V1, "_"),
                           function(x) paste(sort(x), collapse = "_")))) %>%
  group_by(newval) %>%
  summarise()

## # A tibble: 6 x 1
##   newval
##   <chr> 
## 1 a_a   
## 2 a_b   
## 3 a_c   
## 4 b_b   
## 5 b_c   
## 6 c_c   

EDIT

Here is a more simplified version where unlist(lapply... is replace with using sapply

dat %>%
  mutate(newval = sapply(str_split(V1, "_"),
                         function(x) paste(sort(x), collapse = "_"))) %>%
  group_by(newval) %>%
  summarise()

Upvotes: 0

moodymudskipper
moodymudskipper

Reputation: 47300

If all the combinations have duplicates and there is always one that is sorted you can just do:

dat[sapply(strsplit(as.character(dat$V1),"_"),is.unsorted,s=T),,drop=F]
#    V1
# 1 a_a
# 4 b_a
# 5 b_b
# 7 c_a
# 8 c_b
# 9 c_c

More general:

dat[!duplicated(sapply(strsplit(as.character(dat$V1),"_"),
                       function(x) paste(sort(x),collapse=''))),,drop=F]

Upvotes: 1

r2evans
r2evans

Reputation: 160417

You need two things: a function that does the internal sorting of _-separated things; and the ability to remove duplicates.

First:

internalsort <- function(x, split="_") {
  x <- as.character(x)
  sapply(lapply(strsplit(as.character(x), split=split), sort), paste, collapse=split)
}
rbind(as.character(dat$V1), internalsort(dat$V1))
#      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] 
# [1,] "a_a" "a_b" "a_c" "b_a" "b_b" "b_c" "c_a" "c_b" "c_c"
# [2,] "a_a" "a_b" "a_c" "a_b" "b_b" "b_c" "a_c" "b_c" "c_c"

where the second row is internally-sorted of the first row.

Second, you need to find duplicates, with duplicated. Obviously, without the internal sort it finds no dupes:

duplicated(dat$V1)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

but now ...

duplicated(internalsort(dat$V1))
# [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE

So your data:

dat[! duplicated(internalsort(dat$V1)),,drop=FALSE]
#    V1
# 1 a_a
# 2 a_b
# 3 a_c
# 5 b_b
# 6 b_c
# 9 c_c

Upvotes: 0

Related Questions