Reputation: 37

Sequence number for duplicate rows in r

I have a data frame with numerical and character columns in which some rows are duplicates. To discriminate those rows I want to add to each "block" of duplicate rows a sequence number from 1:n as a new column (called "duplicateID" in my example).

My Dataset looks like this:

a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
df1 <-data.frame(a,b)

>df1
       a   b
1    one 3.5
2    one 3.5
3    one 3.5
4    one 2.5
5    two 3.5
6    two 3.5
7  three 1.0
8   four 2.2
9   four 7.0
10  four 7.0

Desired output is:

a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
duplicateID = c(1, 2, 3, 1, 1, 2, 1, 1, 1, 2)
df2 <-data.frame(a,b,duplicateID)

>df2 
       a   b duplicateID
1    one 3.5           1
2    one 3.5           2
3    one 3.5           3
4    one 2.5           1
5    two 3.5           1
6    two 3.5           2
7  three 1.0           1
8   four 2.2           1
9   four 7.0           1
10  four 7.0           2

Thank you all in advance!

Upvotes: 1

Answers (3)

akrun

Reputation: 887981

We could use rowid

library(data.table)
setDT(df1)[, dupID := rowid(a, b)]

-output

> df1
        a   b dupID
 1:   one 3.5     1
 2:   one 3.5     2
 3:   one 3.5     3
 4:   one 2.5     1
 5:   two 3.5     1
 6:   two 3.5     2
 7: three 1.0     1
 8:  four 2.2     1
 9:  four 7.0     1
10:  four 7.0     2

Upvotes: 4

DPH

Reputation: 4354

One way to achive this with dplyr:

library(dplyr)

df1 %>% 
    # build grouping by combination of variables
    dplyr::group_by(a, b) %>%
    # add row number which works per group due to prior grouping
    dplyr::mutate(duplicateID = dplyr::row_number()) %>%
    # ungroup to prevent unexpected behaviour down stream
    dplyr::ungroup()

# A tibble: 10 x 3
   a         b  duplicateID
   <chr> <dbl>     <int>
 1 one     3.5       1
 2 one     3.5       2
 3 one     3.5       3
 4 one     2.5       1
 5 two     3.5       1
 6 two     3.5       2
 7 three   1         1
 8 four    2.2       1
 9 four    7         1
10 four    7         2

Upvotes: 5

Euan Ritchie

Reputation: 362

Might not be as fast as dplyr (sure data.table has options too) but in base R you can achieve this with the "ave" function with "seq_along":

a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
df1 <-data.frame(a,b)
df1$dupID = NA
df1$dupID = with(df1,ave(dupID,b,a,FUN = seq_along))

Upvotes: 2

Sequence number for duplicate rows in r

Answers (3)

Related Questions