Sharif
Sharif

Reputation: 391

filtering using the common values between two columns using dplyr

I have a dataset like this -

tibble(
  year = c (2005, 2005, 2005,
            2006, 2006, 2006,
            2007,2007,2007),
  var1 = c (2136,2136,2136,
            2285,2285,2285,
            2968,2968,2968),
  var2 = c (125860,1318,127464,
            8972,2136,5046,
            4739,1619,10063)
)

I want to filter the dataset like this - first group by the data by variables year and var1, then compare columns var1 and var2, if it is found that the values in columns var1 and var2 match, then I want to filter all of them.

For the example dataset, if we look, we can see in var1 2136 in 2005 matches with 2136 in var2 in 2006. therefore, I would like to filter both 2005 and 2006 rows because in both columns var1 and var2, they have a common value - 2136.

So the expected resulted will look like this -

tibble(
  year = c (2005, 2005, 2005,
            2006, 2006, 2006
            ),
  var1 = c (2136,2136,2136,
            2285,2285,2285),
  var2 = c (125860,1318,127464,
            8972,2136,5046)
)

Can anybody help?

Upvotes: 3

Views: 763

Answers (3)

Anoushiravan R
Anoushiravan R

Reputation: 21908

This may sound verbose but I think it's quite reliable:

library(dplyr)
library(purrr)

df %>%
  filter(year %in% 
           (combn(unique(df$year), 2) %>%
           t() %>% 
           as.data.frame() %>%
           mutate(common = pmap_lgl(., ~ length(intersect(df$var1[df$year == ..1], df$var2[df$year == ..2])) > 0)) %>%
           filter(common) %>%
           select(-common) %>%
           unlist() %>%
           unique()))

# A tibble: 6 x 3
   year  var1   var2
  <dbl> <dbl>  <dbl>
1  2005  2136 125860
2  2005  2136   1318
3  2005  2136 127464
4  2006  2285   8972
5  2006  2285   2136
6  2006  2285   5046

Upvotes: 1

AnilGoyal
AnilGoyal

Reputation: 26218

Rather easy in tidyverse. I am using purrr::map2_lgl() apart from usual dplyr functions here.

library(tidyverse)
df <- tibble(
  year = c (2005, 2005, 2005,
            2006, 2006, 2006,
            2007,2007,2007),
  var1 = c (2136,2136,2136,
            2285,2285,2285,
            2968,2968,2968),
  var2 = c (125860,1318,127464,
            8972,2136,5046,
            4739,1619,10063)
)

df %>% 
  mutate(xx = map2_lgl(var1, var2, ~ .x %in% var2 | .y %in% var1)) %>%
  group_by(year, var1) %>%
  filter(any(xx))
#> # A tibble: 6 x 4
#> # Groups:   year, var1 [2]
#>    year  var1   var2 xx   
#>   <dbl> <dbl>  <dbl> <lgl>
#> 1  2005  2136 125860 TRUE 
#> 2  2005  2136   1318 TRUE 
#> 3  2005  2136 127464 TRUE 
#> 4  2006  2285   8972 FALSE
#> 5  2006  2285   2136 TRUE 
#> 6  2006  2285   5046 FALSE

Created on 2021-06-23 by the reprex package (v2.0.0)

You may remove xx after ungroup, if you want. I kept this for explaining things.

Upvotes: 1

r2evans
r2evans

Reputation: 160427

A non-dplyr suggestion:

uniqyears <- unique(df1$year)
keepyears <- sapply(uniqyears,
                    function(yr) any(df1$var1[df1$year == yr] %in% df1$var2[df1$year != yr] |
                      df1$var2[df1$year == yr] %in% df1$var1[df1$year != yr]))
keepyears
# [1]  TRUE  TRUE FALSE
filter(df1, year %in% uniqyears[keepyears])
# # A tibble: 6 x 3
#    year  var1   var2
#   <dbl> <dbl>  <dbl>
# 1  2005  2136 125860
# 2  2005  2136   1318
# 3  2005  2136 127464
# 4  2006  2285   8972
# 5  2006  2285   2136
# 6  2006  2285   5046

Upvotes: 0

Related Questions