Tag duplicates with first occurrence ID

Question

I'm using the clustercommand and am having difficulties due to insufficient memory. To get around this problem I would like to delete all duplicate observations.

I would like to cluster via the variables A, B and C and I identify duplicate values as so:

   /* Create dummy data */
     input id A B C
        1 1 1 1
        2 1 1 1
        3 1 1 1
        4 2 2 2
        5 2 2 2
        6 2 2 2
        7 2 2 2 
        8 3 3 3
        9 3 3 3
        10 4 4 4
        end

sort A B C id

duplicates tag A B C, gen(dup_tag)

I would like to add a variable dup_ID which tells me that ids 2 and 3 are duplicates of id 1, ids 5 and 6 of id 4, and so on. How could I do this?

/* Desired result */

id A  B  C  dup_id
1  1  1  1  1
2  1  1  1  1
3  1  1  1  1
4  2  2  2  4
5  2  2  2  4
6  2  2  2  4
7  2  2  2  4
8  3  3  3  8
9  3  3  3  8
10 4  4  4  10

Brendan · Accepted Answer

You can refer to the first observation in each group of A B C using the subscript [1] on ID. Note the (id) argument in bysort, which sorts by id, but identifies the groups by A, B, and C only.

clear
input id A B C
1 1 1 1
2 1 1 1
3 1 1 1
4 2 2 2
5 2 2 2
6 2 2 2
7 2 2 2 
8 3 3 3
9 3 3 3
10 4 4 4
end

bysort A B C (id): gen dup_id = id[1]
li, noobs sepby(dup_id)

yielding

  +-------------------------+
  | id   A   B   C   dup_id |
  |-------------------------|
  |  1   1   1   1        1 |
  |  2   1   1   1        1 |
  |  3   1   1   1        1 |
  |-------------------------|
  |  4   2   2   2        4 |
  |  5   2   2   2        4 |
  |  6   2   2   2        4 |
  |  7   2   2   2        4 |
  |-------------------------|
  |  8   3   3   3        8 |
  |  9   3   3   3        8 |
  |-------------------------|
  | 10   4   4   4       10 |
  +-------------------------+

Tag duplicates with first occurrence ID

Answers (2)

Related Questions