Huskir
Huskir

Reputation: 33

Check if one row is equal to any other rows in R

I have a dataset with one ID column, 12 information columns (strings) and n rows. It looks like this:

ID Col1 Col2 Col3 Col4 Col5 ...
01 a    b    c    d    a
02 a    a    a    a    a
03 b    b    b    b    b
...

I need to go row by row and check if that row (considering all of it's columns) is equal to any other row in the dataset. My output needs to be two new columns: one indicating if that particular row is equal to any other row and a second column indicating which row it is equal to (in case of TRUE in the previous column)

I appreciate any suggestions.

Upvotes: 0

Views: 2342

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269596

Assuming DF in the Note at the end, sort it and create a column dup indicating whether there exists a prior duplicate row. Then set to wx to the row number in the original data frame of the duplicate. Finaly resort back.

We have assumed that duplicate means that the columns other than the ID are the same but that is readily changed if need be. We have also assumed that we should mark the second and subsequent rows among duplicates whereas the first is not so marked becaue it has to that point no duplicate.

The question does not address the situation of more than 2 identical rows but if that situation exists then each duplicate will point to the nearest prior row of which it is a duplicate.

o <- do.call("order", DF[-1])
DFo <- DF[o, ]
DFo$wx <- DFo$dup <- duplicated(DFo)
DFo$wx[DFo$dup] <- as.numeric(rownames(DFo))[which(DFo$dup) - 1]
DFo[order(o), ] # back to original order

giving:

  ID Col1 Col2 Col3 Col4 Col5   dup wx
1  1    a    b    c    d    a FALSE  0
2  2    a    a    a    a    a FALSE  0
3  3    b    b    b    b    b FALSE  0
4  1    a    b    c    d    a  TRUE  1

Note

Lines <- "ID Col1 Col2 Col3 Col4 Col5
01 a    b    c    d    a
02 a    a    a    a    a
03 b    b    b    b    b"
DF <- read.table(text = Lines, header = TRUE)
DF <- DF[c(1:3, 1), ]
rownames(DF) <- NULL

giving:

> DF
  ID Col1 Col2 Col3 Col4 Col5
1  1    a    b    c    d    a
2  2    a    a    a    a    a
3  3    b    b    b    b    b
4  1    a    b    c    d    a

Upvotes: 2

ekoam
ekoam

Reputation: 8844

A dplyr solution

library(dplyr)
df %>% 
  mutate(row_num = 1:n(), is_dup = duplicated(df)) %>% 
  group_by(across(-c(row_num, is_dup))) %>% 
  mutate(
    has_copies = n() > 1L, 
    which_row = if_else(is_dup, first(row_num), NA_integer_), 
    row_num = NULL, is_dup = NULL
  )

Output

# A tibble: 5 x 8
# Groups:   ID, Col1, Col2, Col3, Col4, Col5 [3]
  ID    Col1  Col2  Col3  Col4  Col5  has_copies which_row
  <chr> <fct> <fct> <fct> <fct> <fct> <lgl>          <int>
1 1     a     b     c     d     a     FALSE             NA
2 2     a     a     a     a     a     FALSE             NA
3 3     b     b     b     b     b     TRUE              NA
4 3     b     b     b     b     b     TRUE               3
5 3     b     b     b     b     b     TRUE               3
  1. For each row that has more than one copies, the has_copies gives a TRUE.

  2. For a set of rows that are the same, I consider the first one as the original and all other rows as duplicates. In this regard, which_row gives you the index of the original for each duplicate it found. In other words, If a row has no duplicate or is the original, it gives you NA.

Upvotes: 1

arg0naut91
arg0naut91

Reputation: 14764

With a df like below:

  ID Col1 Col2 Col3 Col4 Col5
1  1    a    b    c    d    a
2  2    a    a    a    a    a
3  3    b    b    b    b    b
4  3    b    b    b    b    b

You could try grouping by all columns and checking whether any count > 1 as well as pasting together row numbers (1:nrow(df)):

df <- transform(
  df,
  dupe = ave(ID, mget(names(df)), FUN = length) > 1,
  dupeRows = ave(1:nrow(df), mget(names(df)), FUN = toString)
)

As this would get you a number for each row, even when there are no duplicates, you could do:

df$dupeRows <- with(df, 
                    Map(function(x, y) 
                      toString(x[x != y]), 
                      strsplit(as.character(dupeRows), split = ', '), 
                      1:nrow(df)))

Output:

  ID Col1 Col2 Col3 Col4 Col5  dupe dupeRows
1  1    a    b    c    d    a FALSE         
2  2    a    a    a    a    a FALSE         
3  3    b    b    b    b    b  TRUE        4
4  3    b    b    b    b    b  TRUE        3

Data

df <- structure(list(ID = c(1L, 2L, 3L, 3L), Col1 = structure(c(1L, 
1L, 2L, 2L), .Label = c("a", "b"), class = "factor"), Col2 = structure(c(2L, 
1L, 2L, 2L), .Label = c("a", "b"), class = "factor"), Col3 = structure(c(3L, 
1L, 2L, 2L), .Label = c("a", "b", "c"), class = "factor"), Col4 = structure(c(3L, 
1L, 2L, 2L), .Label = c("a", "b", "d"), class = "factor"), Col5 = structure(c(1L, 
1L, 2L, 2L), .Label = c("a", "b"), class = "factor")), row.names = c(NA, 
-4L), class = "data.frame")

Upvotes: 1

Related Questions