Reputation: 618
I have a large dataset that is of the following structure (here is an example):
+-----------------------------------------+
| id date treat match num |
|-----------------------------------------|
| 1 01feb2000 1 2 2 |
| 1 01apr2000 0 . . |
| 1 01jan2002 1 3 1 |
|-----------------------------------------|
| 2 01mar2000 1 3 0 |
| 2 01may2000 0 . . |
|-----------------------------------------|
| 3 01dec2002 1 . . |
+-----------------------------------------+
For each group identified by an id
I have certain occurrences at certain dates. Each occurrence is either treatment or control. Each treatment observation is matched to a certain id
. You can take id
, date
, treat
and match
variables as given.
My goal is to calculate the value of num
for each treatment observation that actually has a match (there can be cases in which treatment observations have no match, see id
3). The rule is to calculate the number of any occurrences in the matched id
(disregarding whether it is treated or not) within one year starting from the date of the treated observations.
Example: The very first treated observation is matched with id
2. id
2 has two observations within the next year after February 1st, 2000.
Important: First, the date of the first occurrence within a matched id will NEVER be before the date of the considered treatment observation. Second, several treatment observations can be matched with the same id
.
Please be aware that a similar question has been asked here here. Here, I am concerned with the efficiency of the algorithm as my dataset is huge. My solution is the following:
gen NUM = .
gen yrafter = 25000 // arbitrary date outside of the panel
format yrafter %td
gen in_window = 0
sort id date
forval i = 1/`=_N' {
if (match[`i'] != .) {
replace yrafter = (date_installation[`i'] + 365) if ags == match[`i']
replace in_window = date_installation <= yrafter & ags == match[`i']
by id: egen NUM_temp = sum(in_window) if ags == match[`i']
replace NUM_temp = 0 if NUM_temp == .
sum NUM_temp if ags == match[`i'], meanonly
replace NUM0 in `i' = r(max)
drop NUM_temp
replace in_window = 0 if ags == match[`i']
}
}
drop yrafter in_window
To reduce the number of iterations I would actually like to sort after treatment observations and only iterate over them. However, in my understanding of the problem I cannot do this because some subsequent commands require the sorting I have applied above (am I right on this?).
My strategy is to iterate over each relevant observation: For all observations in the matched id
group I replace the value of yrafter
with the latest possible occurrence that needs to be considered. Then in the variable in_window
I simply identify those observations that are before the latest possible date (remember, there cannot be any occurrences before the date of the treatment
variable) and afterwards count all the ones. I save the result in num
and set everything up for the next iteration.
Apparently, this requires a lot of time. My first guess would be to optimize the loop head because checking the if
condition seems very time-consuming for Stata. Does anybody have an idea for optimization?
Upvotes: 0
Views: 2692
Reputation: 11102
Try something along the lines of
clear
set more off
*----- example data -----
input ///
id str20 date treat match num
1 01feb2000 1 2 2
1 01apr2000 0 . .
1 01jan2002 1 3 1
2 01mar2000 1 3 0
2 01may2000 0 . .
3 01dec2002 1 . .
end
list, sepby(id)
gen dat = date(date,"DMY")
format %td dat
drop date
order dat, after(id)
list, sepby(id)
*----- what you want -----
gen num2 = .
forvalues i = 1/`=_N' {
if treat[`i'] == 1 {
count if id == match[`i'] & dat <= dat[`i'] + 365
replace num2 = r(N) in `i'
}
}
replace num2 = . if missing(match)
list, sepby(id)
assert num == num2
I have no information on the size of your data set, so unless you reveal that with precision, I won't do any testing.
(I did not check your code. I simply tried translating the wording of your problem to Stata code.)
Upvotes: 0
Reputation: 37278
I would worry about efficiency only when I had correct code. The following are bugs.
The test if (match != .)
reduces always to if (match[1] != .)
. Presumably you need to test each value in turn. http://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/index.html may apply here.
The statement
replace NUM0 in `i' = r(max)
is illegal, and should be presumably be
replace NUM0 = r(max) in `i'
I can't comment on your major question, as I don't understand what you are trying to do. I guess your explanation is clear and logical; it's just too complicated for me to absorb the whole of what you are doing when I never do anything similar.
Upvotes: 0