Reputation: 97
I have a rather simple question regarding the output of tabstat
command in Stata.
To be more specific, I have a large panel dataset containing several hundred thousands of observations, over a 9 year period.
The context:
bysort year industry: egen total_expenses=total(expenses)
This line should create total expenses by year and industry (or sum of all expenses by all id's in one particular year for one particular industry).
Then I'm using:
tabstat total_expenses, by(country)
As far as I understand, tabstat
should show in a table format the means of expenses. Please do note that ids are different from countries.
In this case tabstat
calculates the means for all 9 years for all industries for a particular country, or it just the mean of one year and one industry by each country from my panel data?
What would happen if this command is used in the following context:
bysort year industry: egen mean_expenses=mean(expenses)
tabstat mean_expenses, by(country)
Does tabstat
creates means of means? This is a little bit confusing.
Upvotes: 1
Views: 776
Reputation: 37208
I don't know what is confusing you about what tabstat
does, but you need to be clear about what calculating means implies. Your dataset is far too big to post here, but for your sake as well as ours creating a tiny sandbox dataset would help you see what is going on. You should experiment with examples where the correct answer (what you want) is obvious or at least easy to calculate.
As a detail, your explanation that ids are different from countries is itself confusing. My guess is that your data are on firms and the identifier concerned identifies the firm. Then you have aggregations by industry
and by country
and separately by year
.
bysort year industry: egen total_expenses = total(expenses)
This does calculate totals and assigns them to every observation. Thus if there are 123 observations for industry A and 2013, there will be 123 identical values of the total in the new variable.
tabstat total_expenses, by(country)
The important detail is that tabstat
by default calculates and shows a mean. It just works on all the observations available, unless you specify otherwise. Stata has no memory or understanding of how total_expenses
was just calculated. The mean will take no account of different numbers in each (industry
, year
) combination. There is no selection of individual values for (industry
, year
) combinations.
Your final question really has the same flavour. What your command asks for is a brute force calculation using all available data. In effect your calculations are weighted by the numbers of observations in whatever combinations of industry
, country
and year
are being aggregated.
I suspect that you need to learn about two commands (1) collapse
and (2) egen
, specifically its tag()
function. If you are using Stata 16, frames may be useful to you. That should apply to any future reader of this using a later version.
Upvotes: 1