Jhonny
Jhonny

Reputation: 618

Counting observations under varying conditions in Stata

I have a large dataset that is of the following structure (here is an example):

+-----------------------------------------+
| id       date   treat   match   num     |
|-----------------------------------------|
|  1  01feb2000       1       2      2    |
|  1  01apr2000       0       .      .    |
|  1  01jan2002       1       3      1    |
|-----------------------------------------|
|  2  01mar2000       1       3      0    |
|  2  01may2000       0       .      .    |
|-----------------------------------------|
|  3  01dec2002       1       .      .    |
+-----------------------------------------+

For each group identified by an id I have certain occurrences at certain dates. Each occurrence is either treatment or control. Each treatment observation is matched to a certain id. You can take id, date, treat and match variables as given.

My goal is to calculate the value of num for each treatment observation that actually has a match (there can be cases in which treatment observations have no match, see id 3). The rule is to calculate the number of any occurrences in the matched id (disregarding whether it is treated or not) within one year starting from the date of the treated observations.

Example: The very first treated observation is matched with id 2. id 2 has two observations within the next year after February 1st, 2000.

Important: First, the date of the first occurrence within a matched id will NEVER be before the date of the considered treatment observation. Second, several treatment observations can be matched with the same id.

Please be aware that a similar question has been asked here here. Here, I am concerned with the efficiency of the algorithm as my dataset is huge. My solution is the following:

gen NUM = .
gen yrafter = 25000    // arbitrary date outside of the panel
format yrafter %td
gen in_window = 0
sort id date

forval i = 1/`=_N' {
    if (match[`i'] != .) {
        replace yrafter = (date_installation[`i'] + 365) if ags == match[`i']
            replace in_window = date_installation <= yrafter & ags == match[`i']

            by id: egen NUM_temp = sum(in_window) if ags == match[`i']
            replace NUM_temp = 0 if NUM_temp == .
            sum NUM_temp if ags == match[`i'], meanonly

            replace NUM0 in `i' = r(max)
        drop NUM_temp

        replace in_window = 0 if ags == match[`i']
    }
}

drop yrafter in_window

To reduce the number of iterations I would actually like to sort after treatment observations and only iterate over them. However, in my understanding of the problem I cannot do this because some subsequent commands require the sorting I have applied above (am I right on this?).

My strategy is to iterate over each relevant observation: For all observations in the matched id group I replace the value of yrafter with the latest possible occurrence that needs to be considered. Then in the variable in_window I simply identify those observations that are before the latest possible date (remember, there cannot be any occurrences before the date of the treatment variable) and afterwards count all the ones. I save the result in num and set everything up for the next iteration.

Apparently, this requires a lot of time. My first guess would be to optimize the loop head because checking the if condition seems very time-consuming for Stata. Does anybody have an idea for optimization?

Upvotes: 0

Views: 2692

Answers (2)

Roberto Ferrer
Roberto Ferrer

Reputation: 11102

Try something along the lines of

clear
set more off

*----- example data -----

input ///
 id  str20 date   treat   match   num     
  1  01feb2000       1       2      2    
  1  01apr2000       0       .      .    
  1  01jan2002       1       3      1    
  2  01mar2000       1       3      0    
  2  01may2000       0       .      .    
  3  01dec2002       1       .      .    
end

list, sepby(id)

gen dat = date(date,"DMY")
format %td dat
drop date
order dat, after(id)

list, sepby(id)

*----- what you want -----

gen num2 = .
forvalues i = 1/`=_N' {
    if treat[`i'] == 1 {
        count if id == match[`i'] & dat <=  dat[`i'] + 365
        replace num2 = r(N) in `i'
    }
}
replace num2 = . if missing(match)

list, sepby(id)

assert num == num2

I have no information on the size of your data set, so unless you reveal that with precision, I won't do any testing.

(I did not check your code. I simply tried translating the wording of your problem to Stata code.)

Upvotes: 0

Nick Cox
Nick Cox

Reputation: 37278

I would worry about efficiency only when I had correct code. The following are bugs.

is illegal, and should be presumably be

    replace NUM0 = r(max) in `i' 

I can't comment on your major question, as I don't understand what you are trying to do. I guess your explanation is clear and logical; it's just too complicated for me to absorb the whole of what you are doing when I never do anything similar.

Upvotes: 0

Related Questions