Reputation: 65
I have the following data frame:
df <- structure(list(GENE= c("ENS1", "ENS2",
"ENS3", "ENS4", "ENS1", "ENS2", "ENS3"), group= c(1L,
1L, 1L, 2L, 3L, 3L, 3L)),
class = "data.frame", row.names = c(NA, -7L))
GENE group
ENS1 1
ENS2 1
ENS3 1
ENS4 2
ENS1 3
ENS2 3
ENS3 3
Since groups 1 and 3 are identical I would like to remove one of them. How can I do that?
Thank you
Upvotes: 5
Views: 132
Reputation: 39647
You can split
df per group and select the list elements which are not duplicated
on GENE and rbind
the result.
x <- unname(split(df, df$group))
do.call(rbind, x[!duplicated(lapply(x, `[[`, "GENE"))])
# GENE group
#1 ENS1 1
#2 ENS2 1
#3 ENS3 1
#4 ENS4 2
In case GENE is not unique and sorted within each group this needs to be done in addition to allow detection of duplicates.
x <- unname(split(df, df$group))
do.call(rbind, x[!duplicated(lapply(x, function(y) sort(unique(y$GENE))))])
# GENE group
#1 ENS1 1
#2 ENS2 1
#3 ENS3 1
#4 ENS4 2
Upvotes: 3
Reputation: 886938
Using base R
subset(df, !duplicated(GENE))
GENE group
1 ENS1 1
2 ENS2 1
3 ENS3 1
4 ENS4 2
Upvotes: 1
Reputation: 388797
You can create an unique key
by pasting the GENE
value for all group
together, keep only the unique keys in the output by joining the original df
.
library(dplyr)
df %>%
group_by(group) %>%
summarise(key = toString(sort(GENE))) %>%
distinct(key, .keep_all = TRUE) %>%
left_join(df, by = 'group') %>%
select(-key)
df
# group GENE
# <int> <chr>
#1 1 ENS1
#2 1 ENS2
#3 1 ENS3
#4 2 ENS4
If you drop the 7th row in the data so that group 1 and group 3 are not identical it will keep rows for all the groups. I hope that is what you meant by "identical".
df <- df[-7, ]
df %>%
group_by(group) %>%
summarise(key = toString(sort(GENE))) %>%
distinct(key, .keep_all = TRUE) %>%
left_join(df, by = 'group') %>%
select(-key)
# group GENE
# <int> <chr>
#1 1 ENS1
#2 1 ENS2
#3 1 ENS3
#4 2 ENS4
#5 3 ENS1
#6 3 ENS2
Upvotes: 3
Reputation: 78917
We could use filter
with !duplicated
:
library(dplyr)
df %>%
filter(!duplicated(GENE))
Output:
GENE group
1 ENS1 1
2 ENS2 1
3 ENS3 1
4 ENS4 2
Upvotes: 3
Reputation: 101034
A base R option using stack
+ unstack
+ duplicated
setNames(
type.convert(
stack((u <- unstack(df))[!duplicated(u)]),
as.is = TRUE
), names(df)
)
which gives
GENE group
1 ENS1 1
2 ENS2 1
3 ENS3 1
4 ENS4 2
Upvotes: 4
Reputation: 3294
library(dplyr)
distinct(df, GENE, .keep_all = TRUE)
Output:
GENE group
1 ENS1 1
2 ENS2 1
3 ENS3 1
4 ENS4 2
Upvotes: 4