Reputation: 57
I have a long
data set similar to the following:
|patient_id |group_number|
|------------------------|
|1 |3 |
|1 |5 |
|2 |5 |
|2 |4 |
|3 |3 |
Pretend that there are many more observations, and many more unique group numbers. I am trying to drop all observations where the group_number has less than 50 occurrences in the data set. I feel like this would involve creating a list of the group_numbers that have more than 50 occurrences (perhaps in a numlist), then dropping the row if the group_number is not in the numlist. My issue, however, is creating this numlist.
So far, I have tried using tab
to get a list of sorted frequencies, and then working with those values:
tab group_number, sort matcell(x)
svmat x
list x if x > 50 & x != .
This gets me a list of the frequencies of the values that have more than fifty occurrences. It escapes me, however, how to translate this list into dropping rows. Am I on the right track, or is there a better method?
I could, of course, accomplish this with a,
drop if group_number == 3 | if group_number == 4 | if group_number == 5
but continuing to list all group numbers with < 50 occurrences. Unfortunately this is not very feasible with the size of my data set.
Upvotes: 0
Views: 2455
Reputation:
Here's code that drops all observations for groups with less than 2 observations.
clear
input pid grp
1 3
1 5
2 5
2 4
3 3
end
bysort grp: drop if _N<2
list, clean noobs
Upvotes: 1