Rachel Rap
Rachel Rap

Reputation: 65

Remove duplicated group dplyr r

I have the following data frame:

df <- structure(list(GENE= c("ENS1", "ENS2", 
"ENS3", "ENS4", "ENS1",  "ENS2", "ENS3"), group= c(1L, 
1L, 1L, 2L, 3L, 3L, 3L)), 
class = "data.frame", row.names = c(NA, -7L))

GENE  group
ENS1  1
ENS2  1
ENS3  1
ENS4  2
ENS1  3
ENS2  3
ENS3  3

Since groups 1 and 3 are identical I would like to remove one of them. How can I do that?

Thank you

Upvotes: 5

Views: 132

Answers (6)

GKi
GKi

Reputation: 39647

You can split df per group and select the list elements which are not duplicated on GENE and rbind the result.

x <- unname(split(df, df$group))
do.call(rbind, x[!duplicated(lapply(x, `[[`, "GENE"))])
#  GENE group
#1 ENS1     1
#2 ENS2     1
#3 ENS3     1
#4 ENS4     2

In case GENE is not unique and sorted within each group this needs to be done in addition to allow detection of duplicates.

x <- unname(split(df, df$group))
do.call(rbind, x[!duplicated(lapply(x, function(y) sort(unique(y$GENE))))])
#  GENE group
#1 ENS1     1
#2 ENS2     1
#3 ENS3     1
#4 ENS4     2

Upvotes: 3

akrun
akrun

Reputation: 886938

Using base R

subset(df, !duplicated(GENE))
  GENE group
1 ENS1     1
2 ENS2     1
3 ENS3     1
4 ENS4     2

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388797

You can create an unique key by pasting the GENE value for all group together, keep only the unique keys in the output by joining the original df.

library(dplyr)

df %>%
  group_by(group) %>%
  summarise(key = toString(sort(GENE))) %>%
  distinct(key, .keep_all = TRUE) %>%
  left_join(df, by = 'group') %>%
  select(-key)

df

#  group GENE 
#  <int> <chr>
#1     1 ENS1 
#2     1 ENS2 
#3     1 ENS3 
#4     2 ENS4 

If you drop the 7th row in the data so that group 1 and group 3 are not identical it will keep rows for all the groups. I hope that is what you meant by "identical".

df <- df[-7, ]

df %>%
  group_by(group) %>%
  summarise(key = toString(sort(GENE))) %>%
  distinct(key, .keep_all = TRUE) %>%
  left_join(df, by = 'group') %>%
  select(-key)

#  group GENE 
#  <int> <chr>
#1     1 ENS1 
#2     1 ENS2 
#3     1 ENS3 
#4     2 ENS4 
#5     3 ENS1 
#6     3 ENS2 

Upvotes: 3

TarJae
TarJae

Reputation: 78917

We could use filter with !duplicated:

library(dplyr)

  df %>% 
    filter(!duplicated(GENE))

Output:

  GENE group
1 ENS1     1
2 ENS2     1
3 ENS3     1
4 ENS4     2

Upvotes: 3

ThomasIsCoding
ThomasIsCoding

Reputation: 101034

A base R option using stack + unstack + duplicated

setNames(
    type.convert(
        stack((u <- unstack(df))[!duplicated(u)]),
        as.is = TRUE
    ), names(df)
)

which gives

  GENE group
1 ENS1     1
2 ENS2     1
3 ENS3     1
4 ENS4     2

Upvotes: 4

bird
bird

Reputation: 3294

library(dplyr)
distinct(df, GENE, .keep_all = TRUE)

Output:

  GENE group
1 ENS1     1
2 ENS2     1
3 ENS3     1
4 ENS4     2

Upvotes: 4

Related Questions