Curious G.
Curious G.

Reputation: 877

Remove all duplicates based on different columns and rows in R

I have this condition:

    ID  N1      N2
    1   6387    6187 
    2   6290    6386 
    3   6295    6295 
    4   6043    6392 
    5   6042    6043
    6   6050    6031 
    7   6050    6036 
    8   6321    6434 
    9   6440    6397

with duplicates in different columns and rows, and I would like to remove ALL duplicates in N1 and N2, like this:

ID  N1      N2
1   6387    6187 
2   6290    6386 
8   6321    6434 
9   6440    6397

I tried these codes:

dt[(duplicated(dt[,2:3]) | duplicated(dt[,2:3], fromLast = TRUE)), ]

and

library(dplyr)
dt %>% distinct(N1,N2, .keep_all = TRUE)

But didn't work.

Upvotes: 0

Views: 90

Answers (5)

user10917479
user10917479

Reputation:

Simple way to filter using dplyr. Each line builds up the logic:

  • Remove N1 duplicates (forward)
  • Remove N2 duplicates (forward)
  • Remove N1 duplicates (backwards)
  • Remove N2 duplicates (backwards)
  • Remove where N1 is in N2
  • Remove where N2 is in N1
library(dplyr)

df %>%
  filter(!duplicated(N1),
         !duplicated(N2),
         !duplicated(N1, fromLast = TRUE),
         !duplicated(N2, fromLast = TRUE),
         !(N1 %in% N2),
         !(N2 %in% N1))

Upvotes: 1

IceCreamToucan
IceCreamToucan

Reputation: 28705

You can make the contents of your data frame into a vector, then check duplicated forward or duplicated backward. Next reshape these results into the same dimensions as your data and filter to rows where the rowSum of this new dupes data is 0, i.e. no column is a dupe. (not clear if you want to include ID in your dupe checks, but you can change unlist(df) to unlist(df[some_columns]) and it will work the same.

df_vec <- unlist(df)
dupes <- duplicated(df_vec) | duplicated(df_vec, fromLast = TRUE)
df[rowSums(matrix(dupes, nrow(df))) == 0,]

#   ID   N1   N2
# 1  1 6387 6187
# 2  2 6290 6386
# 8  8 6321 6434
# 9  9 6440 6397

Upvotes: 3

Allan Cameron
Allan Cameron

Reputation: 174476

Your own method only needed modified a little to concatenate the two columns instead of passing them as a 2x10 array.

nondupes <- !duplicated(c(dt[,2], dt[,3])) & 
            !duplicated(c(dt[,2], dt[,3]), fromLast = TRUE)

dt[nondupes[1:nrow(dt)] & nondupes[nrow(dt) + 1:nrow(dt)],]

Upvotes: 1

haci
haci

Reputation: 260

Here is a base R solution (data.table is only loaded to fread() the data):

library(data.table)

my_data <- fread("
ID  N1      N2
1   6387    6187 
2   6290    6386 
3   6295    6295 
4   6043    6392 
5   6042    6043
6   6050    6031 
7   6050    6036 
8   6321    6434 
9   6440    6397
")

# duplicated entries
dup <- c(my_data$N1, my_data$N2)[duplicated(c(my_data$N1, my_data$N2))]

# keep rows that do not bear any of the duplicated entries
my_data[!(my_data$N1 %in% dup | my_data$N2 %in% dup),]

   ID   N1   N2
1:  1 6387 6187
2:  2 6290 6386
3:  8 6321 6434
4:  9 6440 6397

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 40171

One dplyr and tidyr option could be:

df %>%
 rowid_to_column() %>%
 pivot_longer(-c(ID, rowid)) %>%
 group_by(value) %>%
 filter(n() == 1) %>%
 group_by(rowid) %>%
 filter(n() == 2) %>%
 ungroup() %>%
 pivot_wider(names_from = name, values_from = value) %>%
 select(-rowid)

     ID    N1    N2
  <int> <int> <int>
1     1  6387  6187
2     2  6290  6386
3     8  6321  6434
4     9  6440  6397

Upvotes: 1

Related Questions