copamundial
copamundial

Reputation: 213

find duplicates with grouped variables

I have a df that looks like this:

I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.

from  to  group

1     2   metro
2     4   metro
3     4   metro
4     5   train
6     1   train
8     7   train

I want to find the ids which exist in more than one group variable.

The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.

Thank you in advance!

Upvotes: 1

Views: 1208

Answers (3)

IceCreamToucan
IceCreamToucan

Reputation: 28705

Convert data to long format and count unique values, using data.table. melt is used to convert to long format, and data table allows filtering in the i part of df1[ i, j, k], grouping in the k part, and pulling in the j part.

library(data.table)
library(magrittr)
setDT(df1)

melt(df1, 'group') %>% 
  .[, .(n = uniqueN(group)), value] %>% 
  .[n > 1, unique(value)]

# [1] 1 4

Upvotes: 1

akrun
akrun

Reputation: 887651

We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements

library(dplyr)
library(tidyr)
df1 %>% 
   gather(key, val, from:to) %>% 
   group_by(val) %>% 
   filter(n_distinct(group) > 1) %>%
   distinct(val) %>%
   pull(val)
#[1] 1 4

Or using base R we can just table to find the frequency, and get the ids out of it

out <-  with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"

data

df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L, 
 4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train", 
 "train", "train")), class = "data.frame", row.names = c(NA, -6L
 ))

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389175

Using base R we can split the first two columns based on group and find the intersecting value between the groups using intersect

Reduce(intersect, split(unlist(df[1:2]), df$group))
#[1] 1 4

Upvotes: 4

Related Questions