Reputation: 3432
I have a dataframe such as :
Groups Subgroups Species Anothercolum
G1 1 SP1 jkl
G1 1 SP2 ds
G1 2 SP1 9DS
G1 2 SP2 JD
G1 3 SP1 Js2_0
G1 3 SP2 KUIO
G2 1 SP4 JUI
G2 1 SP5 NA
G2 1 SP6 NA
G2 2 SP4 900
G2 2 SP5 LO
G2 2 SP7 dez
G2 3 SP4 dez
G2 3 SP5 ge
G2 3 SP7 gez
And I would like to remove duplicate elements within Groups
and Subgroups
but only when there is the same content of Species
.
For instance, for G1
:
Groups Subgroups Species
G1 1 SP1
G1 1 SP2
G1 2 SP1
G1 2 SP2
G1 3 SP1
G1 3 SP2
I keep the first Subgroups1
since SP1 and SP2 are both in Subgroup1
and Subgroup2
;
Groups Subgroups Species
G1 1 SP1
G1 1 SP2
But for G2 :
Groups Subgroups Species
G2 1 SP4
G2 1 SP5
G2 1 SP6
G2 2 SP4
G2 2 SP5
G2 2 SP7
G2 3 SP4
G2 3 SP5
G2 3 SP7
The Subgroup1
does not have the same 3 species as subgroup2
and 3
, so I keep that one, but remove the subgroup3
and keep subgroup2
since they have the same species.
I should then get the following expected output :
Groups Subgroups Species Anothercolum
G1 1 SP1 jkl
G1 1 SP2 ds
G2 1 SP4 JUI
G2 1 SP5 NA
G2 1 SP6 NA
G2 2 SP4 900
G2 2 SP5 LO
G2 2 SP7 dez
Here is the dataframe if it can help:
structure(list(Groups = c("G1", "G1", "G1", "G1", "G1", "G1",
"G2", "G2", "G2", "G2", "G2", "G2", "G2", "G2", "G2"), Subgroups = c(1L,
1L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), Species = c("SP1",
"SP2", "SP1", "SP2", "SP1", "SP2", "SP4", "SP5", "SP6", "SP4",
"SP5", "SP7", "SP4", "SP5", "SP7")), class = "data.frame", row.names = c(NA,
-15L))
Upvotes: 1
Views: 53
Reputation: 8844
You can use duplicated
to find duplicates even for a list. Something like this
library(dplyr)
library(tidyr)
df %>%
group_by(Groups, Subgroups) %>%
summarize(across(c(Species, Anothercolum), list), .groups = "drop") %>%
filter(!duplicated(Species)) %>%
unnest(c(Species, Anothercolum))
Output
# A tibble: 8 x 4
Groups Subgroups Species Anothercolum
<chr> <int> <chr> <chr>
1 G1 1 SP1 jkl
2 G1 1 SP2 ds
3 G2 1 SP4 JUI
4 G2 1 SP5 NA
5 G2 1 SP6 NA
6 G2 2 SP4 900
7 G2 2 SP5 LO
8 G2 2 SP7 dez
Another (more efficient) version using paste0
instead of list
since you can find this warning from ?duplicated
Warning
Using this for lists is potentially slow, especially if the elements are not atomic vectors (see vector) or differ only in their attributes. In the worst case it is O(n^2).
library(dplyr)
library(tidyr)
df %>%
group_by(Groups, Subgroups) %>%
summarize(across(c(Species, Anothercolum), paste0, collapse = ","), .groups = "drop") %>%
filter(!duplicated(Species)) %>%
separate_rows(Species, Anothercolum, sep = ",", convert = TRUE)
For this one, you will have to be careful about the separator used. However, both should be fine if you do not have a large dataset.
Upvotes: 3
Reputation: 101034
A base R option
merge(
df,
subset(
aggregate(
Species ~ .,
df,
c
),
as.logical(ave(Species, Groups, FUN = Negate(duplicated))),
select = c(Groups, Subgroups)
)
)
gives
Groups Subgroups Species
1 G1 1 SP1
2 G1 1 SP2
3 G2 1 SP5
4 G2 1 SP4
5 G2 1 SP6
6 G2 2 SP7
7 G2 2 SP4
8 G2 2 SP5
Upvotes: 3