Counting occurrences of values of string variable using collapse

In a dataset like the one below,

clear
input patid str2 dx 
1   qw
1   qe
1   qw
2   qw
2   qw
2   qs
2   qs
3   qe
3   qe
3   qs
3   qw
3   qw
3   qw
3   qs
4   qe
5   qa
5   qs
5   qw
5   qe
5   qw
end

I've found that I can count occurrences of each value of the string variable dx using subscripts[1], or if I convert dx to a numeric label if using collapse[2].

Is there a command or syntax where I can count occurrences directly (without conversion etc.) from the string variable itself when using collapse?

e.g. if I tried collapse (count) countdx=dx, by(patid dx), this returns the error message variable dx not found.

(of course, this shouldn't work: and when I tried collapse (count) countdx=dx, by(patid), this returns the error type mismatch)

Notes:

[1]

by patid dx, sort: egen ndx = count(dx)
by patid dx: g orderdx=_n
by patid dx: drop if orderdx>1

[2]

g numdx=.
replace numdx=1 if dx=="qa"
replace numdx=2 if dx=="qe"
replace numdx=3 if dx=="qs"
replace numdx=4 if dx=="qw"
collapse (count)  countdx=numdx, by(patid dx)

Upvotes: 1

Views: 2736

Answers (1)

Nick Cox
Nick Cox

Reputation: 37208

Your examples, but not your question, all imply that you want to count separately for each distinct value of an identifier patid.

clear
input patid str2 dx 
1   qw
1   qe
1   qw
2   qw
2   qw
2   qs
2   qs
3   qe
3   qe
3   qs
3   qw
3   qw
3   qw
3   qs
4   qe
5   qa
5   qs
5   qw
5   qe
5   qw
end

bysort patid dx : gen count = _N 

tabdisp patid dx , c(count) 

----------------------------------
          |           dx          
    patid |   qa    qe    qs    qw
----------+-----------------------
        1 |          1           2
        2 |                2     2
        3 |          2     2     3
        4 |          1            
        5 |    1     1     1     2
----------------------------------

For a review of technique in this territory, see this paper. Searching Statalist for mentions of dm0042 will find very many related examples.

For problems of even moderate size, tabdisp won't be especially practical. It is mentioned here to show directly what the previous command does.

Extending this to collapse, one simple device is

gen one = 1

collapse (sum) one, by(patid dx)

although I should mention that contract was written more explicitly for this purpose (cf. discussion of its precursor in Cox 1998).

On the other hand, if you did create the count variable, then

collapse (mean) count, by(patid dx) 

would have precisely the same effect.

Cox, N.J. 1998. Collapsing datasets to frequencies. Stata Technical Bulletin 44: 2-3. .pdf here

Upvotes: 3

Related Questions