user2568648
user2568648

Reputation: 3181

Tag duplicates with first occurrence ID

I'm using the clustercommand and am having difficulties due to insufficient memory. To get around this problem I would like to delete all duplicate observations.

I would like to cluster via the variables A, B and C and I identify duplicate values as so:

   /* Create dummy data */
     input id A B C
        1 1 1 1
        2 1 1 1
        3 1 1 1
        4 2 2 2
        5 2 2 2
        6 2 2 2
        7 2 2 2 
        8 3 3 3
        9 3 3 3
        10 4 4 4
        end

sort A B C id

duplicates tag A B C, gen(dup_tag)

I would like to add a variable dup_ID which tells me that ids 2 and 3 are duplicates of id 1, ids 5 and 6 of id 4, and so on. How could I do this?

/* Desired result */

id A  B  C  dup_id
1  1  1  1  1
2  1  1  1  1
3  1  1  1  1
4  2  2  2  4
5  2  2  2  4
6  2  2  2  4
7  2  2  2  4
8  3  3  3  8
9  3  3  3  8
10 4  4  4  10

Upvotes: 0

Views: 2470

Answers (2)

Nick Cox
Nick Cox

Reputation: 37183

duplicates is a wonderful command (see its manual entry for why I say that), but you can do this directly:

bysort A B C : gen tag = _n == 1

tags the first occurrence of duplicates of A B C as 1 and all others as 0. For the other way round use _n > 1, _n != 1, or whatever.

EDIT:

So then the id of tagged observations is just

by A B C: gen dup_id = id[1] 

For basic technique with by: see (e.g.) this discussion

Upvotes: 2

Brendan
Brendan

Reputation: 4011

You can refer to the first observation in each group of A B C using the subscript [1] on ID. Note the (id) argument in bysort, which sorts by id, but identifies the groups by A, B, and C only.

clear
input id A B C
1 1 1 1
2 1 1 1
3 1 1 1
4 2 2 2
5 2 2 2
6 2 2 2
7 2 2 2 
8 3 3 3
9 3 3 3
10 4 4 4
end

bysort A B C (id): gen dup_id = id[1]
li, noobs sepby(dup_id)

yielding

  +-------------------------+
  | id   A   B   C   dup_id |
  |-------------------------|
  |  1   1   1   1        1 |
  |  2   1   1   1        1 |
  |  3   1   1   1        1 |
  |-------------------------|
  |  4   2   2   2        4 |
  |  5   2   2   2        4 |
  |  6   2   2   2        4 |
  |  7   2   2   2        4 |
  |-------------------------|
  |  8   3   3   3        8 |
  |  9   3   3   3        8 |
  |-------------------------|
  | 10   4   4   4       10 |
  +-------------------------+

Upvotes: 1

Related Questions