Buras
Buras

Reputation: 3099

How to write the best code for data aggregation?

I have the following dataset (individual level data):

pid  year state income
1    2000  il    100
2    2000  ms    200
3    2000  al     30
4    2000  dc    400
5    2000  ri    205
1    2001  il    120
2    2001  ms    230
3    2001  al     50
4    2001  dc    400
5    2001  ri    235
.........etc.......

I need to estimate average income for each state in each year and create a new dataset that would look like this:

state   year   average_income
ar      2000      150
ar      2001      200
ar      2002      250
il      2000      150
il      2001      160
il      2002      160
...........etc...............

I already have a code that runs perfectly fine (I have two loops). However, I would like to know is there any better way in Stata like sql style query?

Upvotes: 1

Views: 608

Answers (3)

Nick Cox
Nick Cox

Reputation: 37208

This is shorter code than any suggested so far:

  collapse average_income=income, by(state year) 

Upvotes: 11

Gordon Linoff
Gordon Linoff

Reputation: 1269703

You have the SQL tag on the question. This is a basic aggregation query in SQL:

select state, year, avg(income) as average_income
from t
group by state, year;

To put this in a table, depends on your database. One of the following typically works:

create table NewTable as
    select state, year, avg(income) as average_income
    from t
    group by state, year;

Or:

select state, year, avg(income) as average_income
into NewTable
from t
group by state, year;

Upvotes: 1

Kyle Heuton
Kyle Heuton

Reputation: 9768

This shouldn't need 2 loops, or any for that matter. There are in fact more efficient ways to do this. When you are repeating an operation on many groups, the bysort command is useful:

bysort year state: egen average_income = mean(income)

You also don't have to create a new dataset, you can just prune this one and save it. Start by only keeping the variables you want (state, year and average_income) and get rid of duplicates:

keep state year average_income
duplicates drop
save "mynewdataset.dta"

Upvotes: 2

Related Questions