Reputation: 7167

Fill in missing values of one variable using match with another variable

Imagine the following Stata data structure:

input x y
1 3
1 .
1 .
2 3
2 .
2 .
. 3
end

I want to fill the missing values using the corresponding match of pairs for other observations. However, if there is ambiguity (in the example, 3 corresponding to both 1 and 2), the code should not copy. In my example, the final data structure should look like this:

Note that both 1 and 2 are filled, as they are unambiguously 3.

My data is only numeric, and the number of unique values of variables x and y is large, so I am looking for a general rule that works in every case.

I am thinking on using the user-written command carryforward, running something like

bysort x: carryforward y if x != . , replace dynamic_condition(x[_n-1] == x[_n]) strict
bysort y: carryforward x if y != . , replace dynamic_condition(y[_n-1] == y[_n]) strict

Yet, this does not work when there are double matches.

UPDATE: the solution proposed by Nick does not work for every example. I updated the example to reflect this. The reason why the proposed solution does not work is because the function tag puts a 1 only at one instance of each value. Thus, when a value (3) is related to two values (1, 2), the tag will appear only in one of them. Hence, the copying occurs for one. In the example above, Nick's code and results are:

egen tagy = tag(y) if !missing(y) 
egen tagx = tag(x) if !missing(x) 
egen ny = total(tagy), by(x)
egen nx = total(tagx), by(y) 

bysort x (y) : replace y = y[1] if ny == 1 
bysort y (x) : replace x = x[1] if nx == 1 

list, sep(0)  

     +-------------------------------+
     | x   y   tagy   tagx   ny   nx |
     |-------------------------------|
  1. | 1   3      0      0    1    0 |
  2. | 1   3      0      0    1    0 |
  3. | 1   3      1      1    1    2 |
  4. | 2   3      0      1    0    2 |
  5. | .   3      0      0    0    2 |
  6. | 2   .      0      0    0    0 |
  7. | 2   .      0      0    0    0 |
     +-------------------------------+

As seen, the code works for filling x=1 and not filling y=3 (line 5). Yet, it does not fill lines 6 and 7 because tagy=1 only appears once (x=1).

Upvotes: 1

Answers (2)

Vevina Liu

Reputation: 51

sort x, y
replace y = y[_n-1] if missing(y) & x[_n-1] == x[_n]

Upvotes: 0

Danielle

Reputation: 339

This is a bit clunky, but it should work:

bysort x: egen temp=sd(x) if x!=.  
bysort x (y): replace y=y[1] if temp==0
drop temp

Since the standard deviation of a constant is zero, temp=0 if non-missing x's are all the same.

Upvotes: 0

Fill in missing values of one variable using match with another variable

Answers (2)

Related Questions