Reputation: 3181
I'm using the cluster
command and am having difficulties due to insufficient memory. To get around this problem I would like to delete all duplicate observations.
I would like to cluster via the variables A, B and C and I identify duplicate values as so:
/* Create dummy data */
input id A B C
1 1 1 1
2 1 1 1
3 1 1 1
4 2 2 2
5 2 2 2
6 2 2 2
7 2 2 2
8 3 3 3
9 3 3 3
10 4 4 4
end
sort A B C id
duplicates tag A B C, gen(dup_tag)
I would like to add a variable dup_ID
which tells me that id
s 2 and 3 are duplicates of id
1, id
s 5 and 6 of id
4, and so on. How could I do this?
/* Desired result */
id A B C dup_id
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 2 2 2 4
5 2 2 2 4
6 2 2 2 4
7 2 2 2 4
8 3 3 3 8
9 3 3 3 8
10 4 4 4 10
Upvotes: 0
Views: 2470
Reputation: 37183
duplicates
is a wonderful command (see its manual entry for why I say that), but you can do this directly:
bysort A B C : gen tag = _n == 1
tags the first occurrence of duplicates of A B C
as 1 and all others as 0. For the other way round use _n > 1
, _n != 1
, or whatever.
EDIT:
So then the id
of tagged observations is just
by A B C: gen dup_id = id[1]
For basic technique with by:
see (e.g.) this discussion
Upvotes: 2
Reputation: 4011
You can refer to the first observation in each group of A B C
using the subscript [1]
on ID
. Note the (id)
argument in bysort
, which sorts by id
, but identifies the groups by A
, B
, and C
only.
clear
input id A B C
1 1 1 1
2 1 1 1
3 1 1 1
4 2 2 2
5 2 2 2
6 2 2 2
7 2 2 2
8 3 3 3
9 3 3 3
10 4 4 4
end
bysort A B C (id): gen dup_id = id[1]
li, noobs sepby(dup_id)
yielding
+-------------------------+
| id A B C dup_id |
|-------------------------|
| 1 1 1 1 1 |
| 2 1 1 1 1 |
| 3 1 1 1 1 |
|-------------------------|
| 4 2 2 2 4 |
| 5 2 2 2 4 |
| 6 2 2 2 4 |
| 7 2 2 2 4 |
|-------------------------|
| 8 3 3 3 8 |
| 9 3 3 3 8 |
|-------------------------|
| 10 4 4 4 10 |
+-------------------------+
Upvotes: 1