Compare two columns to return equal values in both

I have four ID columns in my table (called MGIT): Ext_ID_1 through _4. Sometimes the same number is present in different rows in them and I need to filter these rows for further analysis, example:

Starting state:

Ext_ID      Ext_ID_4

1111          2222
3333          4444
5555          1111
6666          7777
8888          9999
9999          1010

Desired filtered outcome:

Ext_ID      Ext_ID_4

1111          2222
5555          1111
8888          9999
9999          1010

I would only like to compare them two at a time for clearer management.

Through previous questions in StackOverflow I have found the following code:

Dupl = MGIT[,c('Ext_ID','Ext_ID_4')]
Result <- MGIT[duplicated(Dupl) | duplicated(Dupl, fromLast=TRUE),]

But it's giving back duplicated only values inside the columns (actually, only inside Ext_ID, but I believe ID_4 doesn't have duplicated values within itself).

I am a beginner at programming and know nothing about the language.

Upvotes: 3

Views: 819

Answers (2)

jpiversen
jpiversen

Reputation: 3212

I assume you want this to work for all your four columns, and not just the two in your example.

Here is a solution that will work no matter how many "Ext_ID" columns you have:

library(dplyr)

# Recreate the data from your example
df <- tibble::tribble(
  ~Ext_ID,      ~Ext_ID_4,
  1111,          2222,
  3333,          4444,
  5555,          1111,
  6666,          7777,
  8888,          9999,
  9999,          1010
)

# The actual code you need - just replace `df` with the name of your table
df %>% 
  filter(
    if_any(
      starts_with("Ext_ID"),
      ~ .x %>% purrr::map_lgl(~sum(df == .x) > 1)
    )
  )

# The output:
#> # A tibble: 4 x 2
#>   Ext_ID Ext_ID_4
#>    <dbl>    <dbl>
#> 1   1111     2222
#> 2   5555     1111
#> 3   8888     9999
#> 4   9999     1010

Explaination:

  • ~ .x %>% purrr::map_lgl(~sum(df == .x) > 1) checks whether each value in a column shows up more than once in the dataframe.
  • starts_with("Ext_ID") makes sure that this is done for all columns starting with "Ext_ID"
  • if_any() makes filter() keep rows where there are at least 1 duplicate value in the df for any of the columns.

Upvotes: 1

Ma&#235;l
Ma&#235;l

Reputation: 52004

A base R solution:

#Select whatever column you want
cols <- unlist(dat[, c('Ext_ID','Ext_ID_4')]) 

#Get the values that appear at least twice 
uni <- unique(cols[ave(cols, cols, FUN=length) > 1]) 

#Filter if any of the row values match with uni
dat[apply(apply(dat, 1, function(x) x %in% uni), 2, any), ] 
  Ext_ID Ext_ID_4
1   1111     2222
3   5555     1111
5   8888     9999
6   9999     1010

Upvotes: 0

Related Questions