Reputation: 598
In my dataset, I have observations for football matches. One of my variables is hometeam
. Now I want to get the average amount of observations per hometeam
. How do I do that in Stata?
I know that I could tab hometeam
, but since there are over 500 distinct hometeams, I don't want to do the calculation manually.
Upvotes: 2
Views: 6036
Reputation: 37208
bysort hometeam : gen n = _N
bysort hometeam : gen tag = _n == 1
su n if tag
EDIT Another way to do it more concisely
bysort hometown : gen n = _N if _n == 1
su n
Why the tagging then? It is often useful to have a tag variable when you are moving back and forth between individual and group level. egen, tag()
does the same thing.
Why if _n == 1
? You need to have this value just once for each group, and there are two ways of doing it that always work for groups that could be as small as one observation, to do it for the first or the last observation in a group. In a group of 1, they are the same, but that doesn't matter. So if _n == _N
is another way to do it.
bysort hometown : gen n = _N if _n == _N
The code needs to be changed in situations where you need not to count missings on some variable
bysort hometown : gen n = sum(!missing(myvar))
by hometown : replace n = . if _n < _N
egen, count()
is similar, but not identical.
Upvotes: 4
Reputation: 15458
Option 2: Using the data of @Roberto
collapse (count) hometeam, by(id)
sum hometeam,meanonly
Upvotes: 2
Reputation: 11102
I assume you can identify the different hometeam
s with some id
variable.
If you want the average number of observations per id
this is one way:
clear all
set more off
input id hometeam
1 .
1 5
1 0
3 6
3 2
3 1
3 9
2 7
2 7
end
list, sepby(id)
bysort id: egen c = count(hometeam)
by id: keep if _n == 1
summarize c, meanonly
disp r(mean)
Note that observations with missings are not counted by count
. If you did want to count the missings, then you could do:
bysort id: gen c = _n
by id: keep if _n == _N
summarize c, meanonly
disp r(mean)
Upvotes: 2