Keeping only most frequent values of a variable Stata

Question

I have a long data set similar to the following:

|patient_id |group_number|
|------------------------|
|1          |3           |
|1          |5           |
|2          |5           |
|2          |4           |
|3          |3           |

Pretend that there are many more observations, and many more unique group numbers. I am trying to drop all observations where the group_number has less than 50 occurrences in the data set. I feel like this would involve creating a list of the group_numbers that have more than 50 occurrences (perhaps in a numlist), then dropping the row if the group_number is not in the numlist. My issue, however, is creating this numlist.

So far, I have tried using tab to get a list of sorted frequencies, and then working with those values:

tab group_number, sort matcell(x) 
svmat x
list x if x > 50 & x != .

This gets me a list of the frequencies of the values that have more than fifty occurrences. It escapes me, however, how to translate this list into dropping rows. Am I on the right track, or is there a better method?

I could, of course, accomplish this with a,

drop if group_number == 3 | if group_number == 4 | if group_number == 5

but continuing to list all group numbers with < 50 occurrences. Unfortunately this is not very feasible with the size of my data set.

user4690969 · Accepted Answer

Here's code that drops all observations for groups with less than 2 observations.

clear
input pid grp
1          3 
1          5 
2          5 
2          4 
3          3 
end
bysort grp: drop if _N<2
list, clean noobs

Keeping only most frequent values of a variable Stata

Answers (1)

Related Questions