Reputation: 551
First, have a look at some variables of my dataset:
firm_id year dyrstr Lack total_workers
2432 2002 1980 29
2432 2003 1980 23
2432 2005 1980 1 283
2432 2006 1980 56
2432 2007 1980 21
2433 2004 2001 42
2433 2006 2001 1 29
2433 2008 2001 1 100
2434 2002 2002 21
2434 2003 2002 55
2434 2004 2002 22
2434 2005 2002 24
2434 2006 2002 17
2434 2007 2002 40
2434 2008 2002 110
2434 2009 2002 158
2434 2010 2002 38
2435 2002 2002 80
2435 2003 2002 86
2435 2004 2002 877
2435 2005 2002 254
2435 2006 2002 71
2435 2007 2002 116
2435 2008 2002 118
2435 2009 2002 1165
2435 2010 2002 67
2436 2002 1992 24
2436 2003 1992 25
2436 2004 1992 22
2436 2005 1992 23
2436 2006 1992 21
2436 2007 1992 100
2436 2008 1992 73
2436 2009 1992 23
2436 2010 1992 40
2437 2002 2002 30
2437 2003 2002 31
2437 2004 2002 21
2437 2006 2002 1 56
2437 2007 2002 20
The variables:
I'd like to fill in the gaps, namely I'd like to create new observations as I show you in the following (only considering the firm with ID 2432):
firm_id year dyrstr Lack total_workers
2432 2002 1980 29
*2432* *2004* *1980* *156*
2432 2003 1980 23
2432 2005 1980 1 283
2432 2006 1980 56
2432 2007 1980 21
The line where I've put the values of the variables in asterisks is the newly created observation. This observation should be a copy of the previous observation but with some modification.
I read something about the the command expand
but help expand
doesn't help me much. Hopefully one of you can help me!
Upvotes: 1
Views: 19822
Reputation: 793
You've already awarded the answer, but I have had to do similar before and always use the cross
command as follows. Say I am using your dataset already & continue with the following code:
tempfile master year
save `master'
preserve
keep year
duplicates drop
save `year'
restore
//next two lines set me up to correct for different year ranges by firm; if year ranges were standard, this would be omitted
bys firm_id: egen minyear=min(year)
bys firm_id: egen maxyear=max(year)
keep firm_id minyear maxyear
duplicates drop
cross using `year'
merge m:1 firm_id year using `master', assert(1 3) nogen
drop if year<minyear | year>maxyear //this adjusts for years outside the earliest and latest years observed by firm; if year ranges standard, again omitted
Then from here, use the ipolate
command in the spirit of @NickCox.
I'm particularly interested in any pros/cons regarding the use of expand
and cross
. (Beyond the fact that my use here specifically hinges on >0 records for each year being observed in order to construct the crossed dataset, which could be eliminated if I create the `year'
tempfile differently.)
Upvotes: 1
Reputation: 11102
The following works if, like in your example database, you don't have consecutive years missing for any given firm. I also assume variable Lack
to be numeric and the final result is an unbalanced panel (you were not specific about this point in your question).
* Expand database
expand 2 if Lack == 1, gen(x)
gsort firm_id year -x
* Substitution rules
replace year = year - 1 if x == 1
replace total_workers = (total_workers[_n-1] + total_workers[_n+1])/2 if x == 1
list, sepby(firm_id)
The expand
line could be re-written as expand Lack + 1, gen(x)
, but maybe it is clearer that way.
For the more general case in which you do have consecutive years missing, the following should get you started under the assumption that Lack
specifies the number of consecutive years missing. For example, if there is a jump from 2006 to 2009 for a given firm, then Lack = 2
for the 2009 observation.
* Expand database
expand Lack + 1, gen(x)
gsort firm_id year -x
* Substitution rules
replace year = year[_n-1] + 1 if x == 1
Now you just need to come up with an imputation rule for your total_workers
:
replace total_workers = ...
If Lack
is a string, convert to numeric using real
.
Upvotes: 1
Reputation: 37208
My suggestions hinge on using expand
, which in turn just requires information on the number of observations to be added. I ignore your variable Lack
, as Stata itself can work out where the gaps are. My procedure for imputing total_workers
is based on using the inbuilt command ipolate
and thus would work over gaps longer than 1 year, which don't appear in your example. The number of workers so estimated is not necessarily an integer.
For other interpolation procedures, check out cipolate
, csipolate
, pchipolate
, all accessible via ssc desc cipolate
(or equivalent).
This kind of operation depends on getting sort
order exactly right, which I don't think is trivial, even with experience, so in getting the code right for similar problems, be prepared for false starts; pepper your trial code with list
statements; and work on a good toy example dataset (as you kindly provided here).
. clear
. input firm_id year dyrstr total_workers
firm_id year dyrstr total_w~s
1. 2432 2002 1980 29
2. 2432 2003 1980 23
3. 2432 2005 1980 283
4. 2432 2006 1980 56
5. 2432 2007 1980 21
6. 2433 2004 2001 42
7. 2433 2006 2001 29
8. 2433 2008 2001 100
9. 2434 2002 2002 21
10. 2434 2003 2002 55
11. 2434 2004 2002 22
12. 2434 2005 2002 24
13. 2434 2006 2002 17
14. 2434 2007 2002 40
15. 2434 2008 2002 110
16. 2434 2009 2002 158
17. 2434 2010 2002 38
18. 2435 2002 2002 80
19. 2435 2003 2002 86
20. 2435 2004 2002 877
21. 2435 2005 2002 254
22. 2435 2006 2002 71
23. 2435 2007 2002 116
24. 2435 2008 2002 118
25. 2435 2009 2002 1165
26. 2435 2010 2002 67
27. 2436 2002 1992 24
28. 2436 2003 1992 25
29. 2436 2004 1992 22
30. 2436 2005 1992 23
31. 2436 2006 1992 21
32. 2436 2007 1992 100
33. 2436 2008 1992 73
34. 2436 2009 1992 23
35. 2436 2010 1992 40
36. 2437 2002 2002 30
37. 2437 2003 2002 31
38. 2437 2004 2002 21
39. 2437 2006 2002 56
40. 2437 2007 2002 20
41. end
. scalar N = _N
. bysort firm_id (year) : gen gap = year - year[_n-1]
(6 missing values generated)
. expand gap
(6 missing counts ignored; observations not deleted)
(4 observations created)
. gen orig = _n <= scalar(N)
. bysort firm_id (year) : replace total_workers = . if !orig
(4 real changes made, 4 to missing)
. bysort firm_id (year orig) : replace year = year[_n-1] + 1 if _n > 1 & year != year[_n-1] + 1
(4 real changes made)
. bysort firm_id (year): ipolate total_workers year , gen(total_workers2)
. list, sepby(firm_id)
+------------------------------------------------------------+
| firm_id year dyrstr total_~s gap orig total_~2 |
|------------------------------------------------------------|
1. | 2432 2002 1980 29 . 1 29 |
2. | 2432 2003 1980 23 1 1 23 |
3. | 2432 2004 1980 . 2 0 153 |
4. | 2432 2005 1980 283 2 1 283 |
5. | 2432 2006 1980 56 1 1 56 |
6. | 2432 2007 1980 21 1 1 21 |
|------------------------------------------------------------|
7. | 2433 2004 2001 42 . 1 42 |
8. | 2433 2005 2001 . 2 0 35.5 |
9. | 2433 2006 2001 29 2 1 29 |
10. | 2433 2007 2001 . 2 0 64.5 |
11. | 2433 2008 2001 100 2 1 100 |
|------------------------------------------------------------|
12. | 2434 2002 2002 21 . 1 21 |
13. | 2434 2003 2002 55 1 1 55 |
14. | 2434 2004 2002 22 1 1 22 |
15. | 2434 2005 2002 24 1 1 24 |
16. | 2434 2006 2002 17 1 1 17 |
17. | 2434 2007 2002 40 1 1 40 |
18. | 2434 2008 2002 110 1 1 110 |
19. | 2434 2009 2002 158 1 1 158 |
20. | 2434 2010 2002 38 1 1 38 |
|------------------------------------------------------------|
21. | 2435 2002 2002 80 . 1 80 |
22. | 2435 2003 2002 86 1 1 86 |
23. | 2435 2004 2002 877 1 1 877 |
24. | 2435 2005 2002 254 1 1 254 |
25. | 2435 2006 2002 71 1 1 71 |
26. | 2435 2007 2002 116 1 1 116 |
27. | 2435 2008 2002 118 1 1 118 |
28. | 2435 2009 2002 1165 1 1 1165 |
29. | 2435 2010 2002 67 1 1 67 |
|------------------------------------------------------------|
30. | 2436 2002 1992 24 . 1 24 |
31. | 2436 2003 1992 25 1 1 25 |
32. | 2436 2004 1992 22 1 1 22 |
33. | 2436 2005 1992 23 1 1 23 |
34. | 2436 2006 1992 21 1 1 21 |
35. | 2436 2007 1992 100 1 1 100 |
36. | 2436 2008 1992 73 1 1 73 |
37. | 2436 2009 1992 23 1 1 23 |
38. | 2436 2010 1992 40 1 1 40 |
|------------------------------------------------------------|
39. | 2437 2002 2002 30 . 1 30 |
40. | 2437 2003 2002 31 1 1 31 |
41. | 2437 2004 2002 21 1 1 21 |
42. | 2437 2005 2002 . 2 0 38.5 |
43. | 2437 2006 2002 56 2 1 56 |
44. | 2437 2007 2002 20 1 1 20 |
+------------------------------------------------------------+
Upvotes: 6