Bradley Allf
Bradley Allf

Reputation: 155

Unique case of finding duplicate values flexibly across columns in R

I have a dataset similar to the following:

df <- data.frame(animal_1 = c("cat", "dog", "mouse", "squirrel"),
                 predation_type = c("eats", "eats", "eaten by", "eats"),
                 animal_2 = c("mouse", "squirrel", "cat", "nuts"))

> df
  animal_1 predation_type animal_2
1      cat           eats    mouse
2      dog           eats squirrel
3    mouse       eaten by      cat
4 squirrel           eats     nuts

I am looking for code that identifies row 1 and row 3 as duplicates since they are showing the same phenomenon (a cat eating a mouse or a mouse being eaten by a cat). I'm not sure how to even ask what kind of duplicate case I'm looking for so I'm hoping someone can help. I've tried combining the text into one column (i.e., "catmouse", "dogsquirrel", etc.) and then inverting the letters but that quickly proved too complex.

Thanks so much for any help you can provide.

Upvotes: 0

Views: 79

Answers (2)

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

tidyverse

df <- data.frame(animal_1 = c("cat", "dog", "mouse", "squirrel"),
                 predation_type = c("eats", "eats", "eaten by", "eats"),
                 animal_2 = c("mouse", "squirrel", "cat", "nuts"))
library(tidyverse)

df %>% 
  rowwise() %>% 
  mutate(duplicates = str_c(sort(c_across(c(1, 3))), collapse = "")) %>% 
  group_by(duplicates) %>% 
  mutate(duplicates = n() > 1) %>% 
  ungroup()
#> # A tibble: 4 x 4
#>   animal_1 predation_type animal_2 duplicates
#>   <chr>    <chr>          <chr>    <lgl>     
#> 1 cat      eats           mouse    TRUE      
#> 2 dog      eats           squirrel FALSE     
#> 3 mouse    eaten by       cat      TRUE      
#> 4 squirrel eats           nuts     FALSE

Created on 2022-01-17 by the reprex package (v2.0.1)

removing duplicates


library(tidyverse)
df %>% 
  filter(!duplicated(map2(animal_1, animal_2, ~str_c(sort((c(.x, .y))), collapse = ""))))
#>   animal_1 predation_type animal_2
#> 1      cat           eats    mouse
#> 2      dog           eats squirrel
#> 3 squirrel           eats     nuts

Created on 2022-01-17 by the reprex package (v2.0.1)

Upvotes: 1

You can sort() the dataframe to make duplicated() useful.

newdf = df[, c('animal_1', 'animal_2')]

for (i in 1:nrow(df)){
  newdf[i, ] = sort(df[i,])
}

newdf[!(duplicated(newdf$animal_1) & duplicated(newdf$animal_2)),]

  animal_1 animal_2
1      cat    mouse
2      dog squirrel
4     nuts squirrel

Upvotes: 0

Related Questions