Reputation: 27

How to select duplicate lines from a data frame?

I have something like this:

df <- data.frame('C1' = c("A","A","C","D","E","E","E"), 'C2' = c("B1","B2","B3","B1","B1","B2","B4"))

And I would like something like that:

C1  C2
A   B1
A   B2
E   B1
E   B2
E   B4

Then I would like something like that:

B1 B2 2  (because A and E)
B1 B4 1  (because only E)
B2 B4 1  (because only E)

Upvotes: 1

Answers (3)

arg0naut91

Reputation: 14764

You could try:

library(dplyr)

df %>% 
  full_join(df, by = 'C1') %>%
  filter(!C2.x == C2.y, 
         !gsub("\\D", "", C2.y) < gsub("\\D", "", C2.x)
         ) %>%
  count(Combination = paste(C2.x, C2.y))

Output:

# A tibble: 3 x 2
  Combination     n
  <chr>       <int>
1 B1 B2           2
2 B1 B4           1
3 B2 B4           1

Upvotes: 2

tmfmnk

Reputation: 39858

One dplyr and tidyr approach could be:

df %>%
 group_by(C1) %>%
 filter(n() >= 2) %>%
 mutate(C3 = list(combn(C2, 2, FUN = paste, collapse = ","))) %>%
 slice(1) %>%
 unnest(C3) %>%
 ungroup() %>%
 count(C3)

  C3        n
  <chr> <int>
1 B1,B2     2
2 B1,B4     1
3 B2,B4     1

First, it filters out cases from C1 with only one observation. Then, it combines every two elements from C2 per group. Finally, it counts the number of appearances per pair.

Considering there could be duplicated C2 values per group:

df %>%
 group_by(C1) %>%
 filter(n() >= 2) %>%
 mutate(C3 = list(combn(unique(C2), 2, FUN = paste, collapse = ",")))  %>%
 slice(1) %>%
 unnest(C3) %>%
 ungroup() %>%
 count(C3)

Upvotes: 3

MA-Lembarki

Reputation: 19

Try to use dplyr and data.table libraries

install.package("tidyverse")
install.package("data.table")

library(tidyverse)
library(data.table)


df2 = df %>%
  filter(C1 %in% c('A', 'E'))

Upvotes: -1

How to select duplicate lines from a data frame?

Answers (3)

Related Questions