Reputation: 111
I have a list of companies with start and end dates for each. I want to count the number of companies alive over time. I have the following code but it runs slowly on my large dataset. Is there a more efficient way to do this in Stata?
forvalues y = 1982/2012 {
forvalues m = 1/12 {
*display date("01-`m'-`y'","DMY")
count if start_dt <= date("01-`m'-`y'","DMY") & date("01-`m'-`y'","DMY") <= end_dt
}
}
Upvotes: 1
Views: 208
Reputation: 37278
Assuming a firm identifier firmid
, this is another way to think about the problem, but with a different data structure. Make sure you have a saved copy of your dataset before you do this.
expand 2
bysort firmid : gen eitherdate = cond(_n == 1, start_dt, end_dt)
by firmid : gen score = cond(_n == 1, 1, -1)
sort eitherdate
gen living = sum(score)
by eitherdate : replace living = living[_N]
So,
We expand
each observation to 2 and put both dates in a new variable, the start date in one observation and the end date in the other observation.
We assign a score that is 1 when a firm starts and -1 when it ends.
The number of firms is increased by 1 every time a firm starts and decreased by 1 every time one ends. We just need to sort by date and the number of firms is the cumulative sum of those scores. (EDIT: There is a fix for changes on the same date.)
This new data structure could be useful for other purposes.
There is a write-up at http://www.stata-journal.com/article.html?article=dm0068
EDIT:
Notes in response to @Roberto Ferrer (and anyone else who read this):
I fixed a bad bug, which made this too difficult to understand. Sorry about that.
The dates used here are just the dates at which firms start and end. There is no evident point in evaluating the number of firms at any other date as it would just be the same number as the previous date used. If you needed, however, to interpolate to a grid of dates, copying the previous count would be sufficient.
It is important not to confuse the Stata function sum()
which returns the cumulative sum with any egen
function. The impression that egen
's total()
is an alternative here was a side-effect of my bug.
Upvotes: 1
Reputation: 11102
One way is to use the inrange
function. In Stata, Date
variables are just integers so you can easily operate on them.
forvalues y = 1982/2012 {
forvalues m = 1/12 {
local d = date("01-`m'-`y'","DMY")
count if inrange(`d', start_dt, end_dt)
}
}
This alone will save you a huge amount of time. For 50.000 observations (and made-up data):
. timer list 1
1: 3.40 / 1 = 3.3980
. timer list 2
2: 18.61 / 1 = 18.6130
timer 1 is with inrange
, timer 2 is your original code. Results are in seconds. Run help inrange
and help timer
for details.
That said, maybe someone can suggest an overall better strategy.
Upvotes: 1