Reputation: 33
I have a dataset with one ID column, 12 information columns (strings) and n rows. It looks like this:
ID Col1 Col2 Col3 Col4 Col5 ...
01 a b c d a
02 a a a a a
03 b b b b b
...
I need to go row by row and check if that row (considering all of it's columns) is equal to any other row in the dataset. My output needs to be two new columns: one indicating if that particular row is equal to any other row and a second column indicating which row it is equal to (in case of TRUE in the previous column)
I appreciate any suggestions.
Upvotes: 0
Views: 2342
Reputation: 269596
Assuming DF
in the Note at the end, sort it and create a column dup
indicating whether there exists a prior duplicate row. Then set to wx
to the row number in the original data frame of the duplicate. Finaly resort back.
We have assumed that duplicate means that the columns other than the ID are the same but that is readily changed if need be. We have also assumed that we should mark the second and subsequent rows among duplicates whereas the first is not so marked becaue it has to that point no duplicate.
The question does not address the situation of more than 2 identical rows but if that situation exists then each duplicate will point to the nearest prior row of which it is a duplicate.
o <- do.call("order", DF[-1])
DFo <- DF[o, ]
DFo$wx <- DFo$dup <- duplicated(DFo)
DFo$wx[DFo$dup] <- as.numeric(rownames(DFo))[which(DFo$dup) - 1]
DFo[order(o), ] # back to original order
giving:
ID Col1 Col2 Col3 Col4 Col5 dup wx
1 1 a b c d a FALSE 0
2 2 a a a a a FALSE 0
3 3 b b b b b FALSE 0
4 1 a b c d a TRUE 1
Lines <- "ID Col1 Col2 Col3 Col4 Col5
01 a b c d a
02 a a a a a
03 b b b b b"
DF <- read.table(text = Lines, header = TRUE)
DF <- DF[c(1:3, 1), ]
rownames(DF) <- NULL
giving:
> DF
ID Col1 Col2 Col3 Col4 Col5
1 1 a b c d a
2 2 a a a a a
3 3 b b b b b
4 1 a b c d a
Upvotes: 2
Reputation: 8844
A dplyr
solution
library(dplyr)
df %>%
mutate(row_num = 1:n(), is_dup = duplicated(df)) %>%
group_by(across(-c(row_num, is_dup))) %>%
mutate(
has_copies = n() > 1L,
which_row = if_else(is_dup, first(row_num), NA_integer_),
row_num = NULL, is_dup = NULL
)
Output
# A tibble: 5 x 8
# Groups: ID, Col1, Col2, Col3, Col4, Col5 [3]
ID Col1 Col2 Col3 Col4 Col5 has_copies which_row
<chr> <fct> <fct> <fct> <fct> <fct> <lgl> <int>
1 1 a b c d a FALSE NA
2 2 a a a a a FALSE NA
3 3 b b b b b TRUE NA
4 3 b b b b b TRUE 3
5 3 b b b b b TRUE 3
For each row that has more than one copies, the has_copies
gives a TRUE
.
For a set of rows that are the same, I consider the first one as the original and all other rows as duplicates. In this regard, which_row
gives you the index of the original for each duplicate it found. In other words, If a row has no duplicate or is the original, it gives you NA
.
Upvotes: 1
Reputation: 14764
With a df
like below:
ID Col1 Col2 Col3 Col4 Col5
1 1 a b c d a
2 2 a a a a a
3 3 b b b b b
4 3 b b b b b
You could try grouping by all columns and checking whether any count > 1
as well as pasting together row numbers (1:nrow(df)
):
df <- transform(
df,
dupe = ave(ID, mget(names(df)), FUN = length) > 1,
dupeRows = ave(1:nrow(df), mget(names(df)), FUN = toString)
)
As this would get you a number for each row, even when there are no duplicates, you could do:
df$dupeRows <- with(df,
Map(function(x, y)
toString(x[x != y]),
strsplit(as.character(dupeRows), split = ', '),
1:nrow(df)))
Output:
ID Col1 Col2 Col3 Col4 Col5 dupe dupeRows
1 1 a b c d a FALSE
2 2 a a a a a FALSE
3 3 b b b b b TRUE 4
4 3 b b b b b TRUE 3
Data
df <- structure(list(ID = c(1L, 2L, 3L, 3L), Col1 = structure(c(1L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor"), Col2 = structure(c(2L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor"), Col3 = structure(c(3L,
1L, 2L, 2L), .Label = c("a", "b", "c"), class = "factor"), Col4 = structure(c(3L,
1L, 2L, 2L), .Label = c("a", "b", "d"), class = "factor"), Col5 = structure(c(1L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor")), row.names = c(NA,
-4L), class = "data.frame")
Upvotes: 1