Reputation: 138
In a dataset like the one below,
clear
input patid str2 dx
1 qw
1 qe
1 qw
2 qw
2 qw
2 qs
2 qs
3 qe
3 qe
3 qs
3 qw
3 qw
3 qw
3 qs
4 qe
5 qa
5 qs
5 qw
5 qe
5 qw
end
I've found that I can count occurrences of each value of the string variable dx
using subscripts[1], or if I convert dx
to a numeric label if using collapse
[2].
Is there a command or syntax where I can count occurrences directly (without conversion etc.) from the string variable itself when using collapse
?
e.g. if I tried collapse (count) countdx=dx, by(patid dx)
, this returns the error message variable dx not found
.
(of course, this shouldn't work: and when I tried collapse (count) countdx=dx, by(patid)
, this returns the error type mismatch
)
Notes:
[1]
by patid dx, sort: egen ndx = count(dx)
by patid dx: g orderdx=_n
by patid dx: drop if orderdx>1
[2]
g numdx=.
replace numdx=1 if dx=="qa"
replace numdx=2 if dx=="qe"
replace numdx=3 if dx=="qs"
replace numdx=4 if dx=="qw"
collapse (count) countdx=numdx, by(patid dx)
Upvotes: 1
Views: 2736
Reputation: 37208
Your examples, but not your question, all imply that you want to count separately for each distinct value of an identifier patid
.
clear
input patid str2 dx
1 qw
1 qe
1 qw
2 qw
2 qw
2 qs
2 qs
3 qe
3 qe
3 qs
3 qw
3 qw
3 qw
3 qs
4 qe
5 qa
5 qs
5 qw
5 qe
5 qw
end
bysort patid dx : gen count = _N
tabdisp patid dx , c(count)
----------------------------------
| dx
patid | qa qe qs qw
----------+-----------------------
1 | 1 2
2 | 2 2
3 | 2 2 3
4 | 1
5 | 1 1 1 2
----------------------------------
For a review of technique in this territory, see this paper. Searching Statalist for mentions of dm0042
will find very many related examples.
For problems of even moderate size, tabdisp
won't be especially practical. It is mentioned here to show directly what the previous command does.
Extending this to collapse
, one simple device is
gen one = 1
collapse (sum) one, by(patid dx)
although I should mention that contract
was written more explicitly for this purpose (cf. discussion of its precursor in Cox 1998).
On the other hand, if you did create the count
variable, then
collapse (mean) count, by(patid dx)
would have precisely the same effect.
Cox, N.J. 1998. Collapsing datasets to frequencies. Stata Technical Bulletin 44: 2-3. .pdf here
Upvotes: 3