Peter Verbeet
Peter Verbeet

Reputation: 1816

find which row duplicates which row in a data.frame

I have a data.frame like this:

data.frame(matrix(c(11:13, 21:23, 11:13, 11:13, 31:33, 41:43, 31:33), byrow = TRUE, ncol = 3))

Now I want to know which row is a duplicate of which row, returning an index vector with the lowest row number that is duplicated. if a row is not a duplicate of a previous row, it should get the next available index. In this example the output should be:

c(1, 2, 1, 1, 3, 4, 3)

I can achieve this by looping across all pairs of rows, but there must be an efficient way of doing this.

Unfortunately, duplicated only shows which rows are duplicates, but not WHICH row they duplicate exactly. Is there a function that could help here?

Upvotes: 2

Views: 143

Answers (4)

akrun
akrun

Reputation: 887851

Another option is .GRP from data.table

library(data.table)
setDT(df1)[, grp := .GRP , .(X1, X2, X3)]$grp
#[1] 1 2 1 1 3 4 3

Upvotes: 1

alexis_laz
alexis_laz

Reputation: 13122

Another alternative using the grouping function in the newer versions of R.

Get the order of rows where identical values are placed next to each other:

grs = do.call(grouping, dat)

And manipulate the "attributes" of the result to get the wanted outcome:

ends = attr(grs, "ends")
rep(seq_along(ends), c(ends[1], diff(ends)))[order(grs)]
#[1] 1 2 1 1 3 4 3

Upvotes: 5

akuiper
akuiper

Reputation: 215117

As an alternative, you can use group_indices from dplyr:

dplyr::group_indices(df, X1, X2, X3)
# [1] 1 2 1 1 3 4 3

Where X1, X2 and X3 are the column names of your data frame.

Upvotes: 3

Maurits Evers
Maurits Evers

Reputation: 50728

Is this what you're after?

# Your data
d <- data.frame(matrix(c(11:13, 21:23, 11:13, 11:13, 31:33, 41:43, 31:23), byrow = TRUE, ncol = 3))

# Indices of unique rows 
idx <- as.numeric(factor(apply(d, 1, paste, collapse = "_"), 
                         levels = unique(apply(d, 1, paste, collapse = "_"))));
print(idx);
[1] 1 2 1 1 3 4 5 6 7

Upvotes: 4

Related Questions