bonifaz
bonifaz

Reputation: 598

How to get average number of observations per group?

In my dataset, I have observations for football matches. One of my variables is hometeam. Now I want to get the average amount of observations per hometeam. How do I do that in Stata?

I know that I could tab hometeam, but since there are over 500 distinct hometeams, I don't want to do the calculation manually.

Upvotes: 2

Views: 6036

Answers (3)

Nick Cox
Nick Cox

Reputation: 37208

bysort hometeam : gen n = _N 
bysort hometeam : gen tag = _n == 1 
su n if tag 

EDIT Another way to do it more concisely

bysort hometown : gen n = _N if _n == 1 
su n 

Why the tagging then? It is often useful to have a tag variable when you are moving back and forth between individual and group level. egen, tag() does the same thing.

Why if _n == 1? You need to have this value just once for each group, and there are two ways of doing it that always work for groups that could be as small as one observation, to do it for the first or the last observation in a group. In a group of 1, they are the same, but that doesn't matter. So if _n == _N is another way to do it.

bysort hometown : gen n = _N if _n == _N 

The code needs to be changed in situations where you need not to count missings on some variable

bysort hometown : gen n = sum(!missing(myvar)) 
by hometown : replace n = . if _n < _N 

egen, count() is similar, but not identical.

Upvotes: 4

Metrics
Metrics

Reputation: 15458

Option 2: Using the data of @Roberto

   collapse (count) hometeam, by(id)
    sum hometeam,meanonly

Upvotes: 2

Roberto Ferrer
Roberto Ferrer

Reputation: 11102

I assume you can identify the different hometeams with some id variable.

If you want the average number of observations per id this is one way:

clear all
set more off

input id hometeam
1 .
1 5
1 0
3 6
3 2
3 1
3 9
2 7
2 7
end

list, sepby(id)

bysort id: egen c = count(hometeam)
by id: keep if _n == 1
summarize c, meanonly
disp r(mean)

Note that observations with missings are not counted by count. If you did want to count the missings, then you could do:

bysort id: gen c = _n
by id: keep if _n == _N
summarize c, meanonly
disp r(mean)

Upvotes: 2

Related Questions