Reputation: 1816
I have a data.frame like this:
data.frame(matrix(c(11:13, 21:23, 11:13, 11:13, 31:33, 41:43, 31:33), byrow = TRUE, ncol = 3))
Now I want to know which row is a duplicate of which row, returning an index vector with the lowest row number that is duplicated. if a row is not a duplicate of a previous row, it should get the next available index. In this example the output should be:
c(1, 2, 1, 1, 3, 4, 3)
I can achieve this by looping across all pairs of rows, but there must be an efficient way of doing this.
Unfortunately, duplicated
only shows which rows are duplicates, but not WHICH row they duplicate exactly. Is there a function that could help here?
Upvotes: 2
Views: 143
Reputation: 887851
Another option is .GRP
from data.table
library(data.table)
setDT(df1)[, grp := .GRP , .(X1, X2, X3)]$grp
#[1] 1 2 1 1 3 4 3
Upvotes: 1
Reputation: 13122
Another alternative using the grouping
function in the newer versions of R.
Get the order of rows where identical values are placed next to each other:
grs = do.call(grouping, dat)
And manipulate the "attributes" of the result to get the wanted outcome:
ends = attr(grs, "ends")
rep(seq_along(ends), c(ends[1], diff(ends)))[order(grs)]
#[1] 1 2 1 1 3 4 3
Upvotes: 5
Reputation: 215117
As an alternative, you can use group_indices
from dplyr
:
dplyr::group_indices(df, X1, X2, X3)
# [1] 1 2 1 1 3 4 3
Where X1, X2
and X3
are the column names of your data frame.
Upvotes: 3
Reputation: 50728
Is this what you're after?
# Your data
d <- data.frame(matrix(c(11:13, 21:23, 11:13, 11:13, 31:33, 41:43, 31:23), byrow = TRUE, ncol = 3))
# Indices of unique rows
idx <- as.numeric(factor(apply(d, 1, paste, collapse = "_"),
levels = unique(apply(d, 1, paste, collapse = "_"))));
print(idx);
[1] 1 2 1 1 3 4 5 6 7
Upvotes: 4