Reputation: 65

Remove duplicated group dplyr r

I have the following data frame:

df <- structure(list(GENE= c("ENS1", "ENS2", 
"ENS3", "ENS4", "ENS1",  "ENS2", "ENS3"), group= c(1L, 
1L, 1L, 2L, 3L, 3L, 3L)), 
class = "data.frame", row.names = c(NA, -7L))

GENE  group
ENS1  1
ENS2  1
ENS3  1
ENS4  2
ENS1  3
ENS2  3
ENS3  3

Since groups 1 and 3 are identical I would like to remove one of them. How can I do that?

Thank you

Upvotes: 5

Answers (6)

GKi

Reputation: 39647

You can split df per group and select the list elements which are not duplicated on GENE and rbind the result.

x <- unname(split(df, df$group))
do.call(rbind, x[!duplicated(lapply(x, `[[`, "GENE"))])
#  GENE group
#1 ENS1     1
#2 ENS2     1
#3 ENS3     1
#4 ENS4     2

In case GENE is not unique and sorted within each group this needs to be done in addition to allow detection of duplicates.

x <- unname(split(df, df$group))
do.call(rbind, x[!duplicated(lapply(x, function(y) sort(unique(y$GENE))))])
#  GENE group
#1 ENS1     1
#2 ENS2     1
#3 ENS3     1
#4 ENS4     2

Upvotes: 3

akrun

Reputation: 886938

Using base R

subset(df, !duplicated(GENE))
  GENE group
1 ENS1     1
2 ENS2     1
3 ENS3     1
4 ENS4     2

Upvotes: 1

Ronak Shah

Reputation: 388797

You can create an unique key by pasting the GENE value for all group together, keep only the unique keys in the output by joining the original df.

library(dplyr)

df %>%
  group_by(group) %>%
  summarise(key = toString(sort(GENE))) %>%
  distinct(key, .keep_all = TRUE) %>%
  left_join(df, by = 'group') %>%
  select(-key)

df

#  group GENE 
#  <int> <chr>
#1     1 ENS1 
#2     1 ENS2 
#3     1 ENS3 
#4     2 ENS4

If you drop the 7th row in the data so that group 1 and group 3 are not identical it will keep rows for all the groups. I hope that is what you meant by "identical".

df <- df[-7, ]

df %>%
  group_by(group) %>%
  summarise(key = toString(sort(GENE))) %>%
  distinct(key, .keep_all = TRUE) %>%
  left_join(df, by = 'group') %>%
  select(-key)

#  group GENE 
#  <int> <chr>
#1     1 ENS1 
#2     1 ENS2 
#3     1 ENS3 
#4     2 ENS4 
#5     3 ENS1 
#6     3 ENS2

Upvotes: 3

TarJae

Reputation: 78917

We could use filter with !duplicated:

library(dplyr)

  df %>% 
    filter(!duplicated(GENE))

Output:

  GENE group
1 ENS1     1
2 ENS2     1
3 ENS3     1
4 ENS4     2

Upvotes: 3

ThomasIsCoding

Reputation: 101034

A base R option using stack + unstack + duplicated

setNames(
    type.convert(
        stack((u <- unstack(df))[!duplicated(u)]),
        as.is = TRUE
    ), names(df)
)

which gives

  GENE group
1 ENS1     1
2 ENS2     1
3 ENS3     1
4 ENS4     2

Upvotes: 4

bird

Reputation: 3294

library(dplyr)
distinct(df, GENE, .keep_all = TRUE)

Output:

  GENE group
1 ENS1     1
2 ENS2     1
3 ENS3     1
4 ENS4     2

Upvotes: 4

Remove duplicated group dplyr r

Answers (6)

Related Questions