Hosea
Hosea

Reputation: 205

Matching values in a variable by year

I have the following minimal example:

input str5 name year match1 match2 match3
Alice 2000 . . .  
Alice 2000 . . .  
Bob 2000 . . . 
Carol 2001 0 . . 
Alice 2002 0 1 .
Carol 2002 1 0 .
Bob 2003 0 0 1
Bob 2003 0 0 1
end 

I have data on name and year, and I want to create binary variables called match'year' that equals 1 if this name is in the data previous 'year'. For example, looking at the first observation in Stata, match1 is a binary variable that equals 1 if Alice appears in year 1999, and match2 is a binary variable that equals 1 if Alice appears in 1998, etc.

If there is no year prior to that year (in this case there is no 1999 or 1998), the binary variable will be missing.

How can I construct these match variables? Note that I have millions of unique names, and using command levelsof name, local(match) results in macro substitution results in line that is too long error. Also note that there are sometimes duplicates of names in a given year, and some names may be missing in a given year.

Upvotes: 0

Views: 218

Answers (2)

langtang
langtang

Reputation: 24722

Here is an alternative approach using frames:

keep name year
frame copy default prev
frame prev: duplicates drop
frame prev: rename year myear

gen myear=.
forvalues i=1/3 {
    replace myear = year-`i'
    frlink m:1 name myear, frame(prev) generate(match`i')
    replace match`i' = 1 if match`i'!=.
}
drop myear

Output:

        name   year   match1   match2   match3  
  1.   Alice   2000        .        .        .  
  2.   Alice   2000        .        .        .  
  3.     Bob   2000        .        .        .  
  4.   Carol   2001        .        .        .  
  5.   Alice   2002        .        1        .  
  6.   Carol   2002        1        .        .  
  7.     Bob   2003        .        .        1  
  8.     Bob   2003        .        .        1  

Upvotes: 1

Nick Cox
Nick Cox

Reputation: 37183

Thanks for the data example. Here is some technique using rangestat from SSC. I don't understand your rule on which values should be 0 and which missing.

* Example generated by -dataex-. For more info, type help dataex
clear
input str5 name float year
"Alice" 2000
"Alice" 2000
"Alice" 2002
"Bob"   2000
"Bob"   2003
"Bob"   2003
"Carol" 2001
"Carol" 2002
end

gen one = 1 

forval j = 1/3 {
    rangestat (max) match`j'=one, int(year -`j' -`j') by(name)
}

drop one 

sort name year 
list, sepby(year)


     +-----------------------------------------+
     |  name   year   match1   match2   match3 |
     |-----------------------------------------|
  1. | Alice   2000        .        .        . |
  2. | Alice   2000        .        .        . |
     |-----------------------------------------|
  3. | Alice   2002        .        1        . |
     |-----------------------------------------|
  4. |   Bob   2000        .        .        . |
     |-----------------------------------------|
  5. |   Bob   2003        .        .        1 |
  6. |   Bob   2003        .        .        1 |
     |-----------------------------------------|
  7. | Carol   2001        .        .        . |
     |-----------------------------------------|
  8. | Carol   2002        1        .        . |
     +-----------------------------------------+

As the original author of levelsof I find it a little melancholy to see it pressed into service where it is of little or no help.

Upvotes: 3

Related Questions