Reputation: 25
Let's say I have a character column in a data frame that contains every double combination of a
, b
and c
, like so
dat <- data.frame(V1 = c("a_a","a_b","a_c","b_a","b_b","b_c","c_a","c_b","c_c"))
However I do not care about order and so would like to remove the duplicates b_a
, c_a
and c_b
as I already have a_b
, a_c
and b_c
.
dat <- data.frame(V1 = c("a_a","a_b","a_c","b_b","b_c","c_c"))
I usually use dplyr for data wrangling purposes, but I fail to see how dplyr::distinct()
could achieve this.
I am of course happy to consider any (non-dplyr) solution. Thanks!
Upvotes: 0
Views: 381
Reputation: 5532
You could do the following using dplyr
and stringr
:
dat %>%
mutate(newval = unlist(
lapply(stringr::str_split(V1, "_"),
function(x) paste(sort(x), collapse = "_")))) %>%
group_by(newval) %>%
summarise()
## # A tibble: 6 x 1
## newval
## <chr>
## 1 a_a
## 2 a_b
## 3 a_c
## 4 b_b
## 5 b_c
## 6 c_c
EDIT
Here is a more simplified version where unlist(lapply...
is replace with using sapply
dat %>%
mutate(newval = sapply(str_split(V1, "_"),
function(x) paste(sort(x), collapse = "_"))) %>%
group_by(newval) %>%
summarise()
Upvotes: 0
Reputation: 47300
If all the combinations have duplicates and there is always one that is sorted you can just do:
dat[sapply(strsplit(as.character(dat$V1),"_"),is.unsorted,s=T),,drop=F]
# V1
# 1 a_a
# 4 b_a
# 5 b_b
# 7 c_a
# 8 c_b
# 9 c_c
More general:
dat[!duplicated(sapply(strsplit(as.character(dat$V1),"_"),
function(x) paste(sort(x),collapse=''))),,drop=F]
Upvotes: 1
Reputation: 160417
You need two things: a function that does the internal sorting of _
-separated things; and the ability to remove duplicates.
First:
internalsort <- function(x, split="_") {
x <- as.character(x)
sapply(lapply(strsplit(as.character(x), split=split), sort), paste, collapse=split)
}
rbind(as.character(dat$V1), internalsort(dat$V1))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a_a" "a_b" "a_c" "b_a" "b_b" "b_c" "c_a" "c_b" "c_c"
# [2,] "a_a" "a_b" "a_c" "a_b" "b_b" "b_c" "a_c" "b_c" "c_c"
where the second row is internally-sorted of the first row.
Second, you need to find duplicates, with duplicated
. Obviously, without the internal sort it finds no dupes:
duplicated(dat$V1)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
but now ...
duplicated(internalsort(dat$V1))
# [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
So your data:
dat[! duplicated(internalsort(dat$V1)),,drop=FALSE]
# V1
# 1 a_a
# 2 a_b
# 3 a_c
# 5 b_b
# 6 b_c
# 9 c_c
Upvotes: 0