Reputation: 109

Filter for rows with duplicate values in dplyr

Tibble:

df <- tibble(x = c(1, 3, 2, 5, 3, 7, 1, 9),
              y = c(rep("condition_one", 5), rep("condition_two", 3)))

Goal:

I want to return rows of 'non-unique' values in x AND for condition_one in y. In this example it would be row 2 and 5 because the value 3 is not unique in x and y is condition_one.

What I first tried was using the duplicated() function from base R:

df %>% 
  filter(duplicated(.$x) & y == "condition_one")

Output:

# A tibble: 1 x 2
      x y            
  <dbl> <chr>        
1     3 condition_one

Though it only returns the duplicated row and not the "original" one. What would be a solution in this case that I can also include in a dplyr pipe chain?

EDIT:

I see that some answers still don't solve my problem and I think it's because I phrased my question badly: In my filtered output, I only want rows where the value in x was not unique. But I want R to return ALL the non-unique rows, not only the one where the duplicate value first occured (which is happening with duplicate()). On top of that, I want y == "condition_one"

Upvotes: 1

Answers (5)

Jacob Nersting

Reputation: 1

With base R. It is not pretty, but it works and includes row 1, which I believe it should:

df <- tibble(row = 1:8,
           x = c(1, 3, 2, 5, 3, 7, 1, 9),
           y = c(rep("condition_one", 5), rep("condition_two", 3)))


selection <-   df[df$x %in% df$x[duplicated(df$x)] & df$y == "condition_one",]

or with larger data perhaps:

selection2 <- df[df$x %in% unique(df$x[duplicated(df$x)]) & df$y == "condition_one",]

Upvotes: 0

Faustin Gashakamba

Reputation: 171

I recently found out this package called campfin and it contains a function, flag_dupes() which does exactly what you want. In a nutshell, it adds a logical column it calls dupe_flag which you can use to filter out unique rows.

Upvotes: 0

eduardokapp

Reputation: 1751

I'm not sure I get if you want to find duplicates in the universe where df$y meets "condition_one" or the opposite. I'll write both cases.

In base R:

Case 1: Find all duplicates and then find which ones meet condition.

all_duplicates <- df[duplicated(df$x) | duplicated(df$x, fromLast = TRUE), ]
your_condition <- all_duplicates[all_duplicates$y == "condition_one", ]

Case 2: Find which ones meet condition and then find duplicates there.

df_cond <- df[df$y == "condition_one", ]

all_duplicates <- df_cond[duplicated(df_cond$x) | duplicated(df_cond$x, fromLast = TRUE), ]

Upvotes: 2

schoulten

Reputation: 68

Here is the solution:

df %>% 
  group_by(x) %>% 
  filter(y == "condition_one" & n() == 1) %>% 
  ungroup()

#Results:
# A tibble: 2 x 2
      x y            
  <dbl> <chr>        
1     2 condition_one
2     5 condition_one

Upvotes: 2

akrun

Reputation: 887118

Using dplyr

library(dplyr)
df %>%
    filter(duplicated(cur_data()) & y == 'condition_one')

Upvotes: 0

Filter for rows with duplicate values in dplyr

Answers (5)

Related Questions