Mame
Mame

Reputation: 111

Count number of living firms efficiently

I have a list of companies with start and end dates for each. I want to count the number of companies alive over time. I have the following code but it runs slowly on my large dataset. Is there a more efficient way to do this in Stata?

forvalues y = 1982/2012 {
    forvalues m = 1/12 {

    *display date("01-`m'-`y'","DMY")
    count if start_dt <= date("01-`m'-`y'","DMY") & date("01-`m'-`y'","DMY") <= end_dt

    }
}

Upvotes: 1

Views: 208

Answers (2)

Nick Cox
Nick Cox

Reputation: 37278

Assuming a firm identifier firmid, this is another way to think about the problem, but with a different data structure. Make sure you have a saved copy of your dataset before you do this.

expand 2 
bysort firmid : gen eitherdate = cond(_n == 1, start_dt, end_dt) 
by firmid : gen score = cond(_n == 1, 1, -1) 
sort eitherdate 
gen living = sum(score) 
by eitherdate : replace living = living[_N] 

So,

  1. We expand each observation to 2 and put both dates in a new variable, the start date in one observation and the end date in the other observation.

  2. We assign a score that is 1 when a firm starts and -1 when it ends.

  3. The number of firms is increased by 1 every time a firm starts and decreased by 1 every time one ends. We just need to sort by date and the number of firms is the cumulative sum of those scores. (EDIT: There is a fix for changes on the same date.)

This new data structure could be useful for other purposes.

There is a write-up at http://www.stata-journal.com/article.html?article=dm0068

EDIT:

Notes in response to @Roberto Ferrer (and anyone else who read this):

  1. I fixed a bad bug, which made this too difficult to understand. Sorry about that.

  2. The dates used here are just the dates at which firms start and end. There is no evident point in evaluating the number of firms at any other date as it would just be the same number as the previous date used. If you needed, however, to interpolate to a grid of dates, copying the previous count would be sufficient.

  3. It is important not to confuse the Stata function sum() which returns the cumulative sum with any egen function. The impression that egen's total() is an alternative here was a side-effect of my bug.

Upvotes: 1

Roberto Ferrer
Roberto Ferrer

Reputation: 11102

One way is to use the inrange function. In Stata, Date variables are just integers so you can easily operate on them.

forvalues y = 1982/2012 {

    forvalues m = 1/12 {

        local d = date("01-`m'-`y'","DMY")
        count if inrange(`d', start_dt, end_dt)

    }
}

This alone will save you a huge amount of time. For 50.000 observations (and made-up data):

. timer list 1
   1:      3.40 /        1 =       3.3980

. timer list 2
   2:     18.61 /        1 =      18.6130

timer 1 is with inrange, timer 2 is your original code. Results are in seconds. Run help inrange and help timer for details.

That said, maybe someone can suggest an overall better strategy.

Upvotes: 1

Related Questions