Reputation: 7167
Imagine the following Stata data structure:
input x y
1 3
1 .
1 .
2 3
2 .
2 .
. 3
end
I want to fill the missing values using the corresponding match of pairs for other observations. However, if there is ambiguity (in the example, 3 corresponding to both 1 and 2), the code should not copy. In my example, the final data structure should look like this:
1 3
1 3
1 3
2 3
2 3
2 3
. 3
Note that both 1 and 2 are filled, as they are unambiguously 3.
My data is only numeric, and the number of unique values of variables x
and y
is large, so I am looking for a general rule that works in every case.
I am thinking on using the user-written command carryforward
, running something like
bysort x: carryforward y if x != . , replace dynamic_condition(x[_n-1] == x[_n]) strict
bysort y: carryforward x if y != . , replace dynamic_condition(y[_n-1] == y[_n]) strict
Yet, this does not work when there are double matches.
UPDATE: the solution proposed by Nick does not work for every example. I updated the example to reflect this. The reason why the proposed solution does not work is because the function tag
puts a 1 only at one instance of each value. Thus, when a value (3) is related to two values (1, 2), the tag will appear only in one of them. Hence, the copying occurs for one. In the example above, Nick's code and results are:
egen tagy = tag(y) if !missing(y)
egen tagx = tag(x) if !missing(x)
egen ny = total(tagy), by(x)
egen nx = total(tagx), by(y)
bysort x (y) : replace y = y[1] if ny == 1
bysort y (x) : replace x = x[1] if nx == 1
list, sep(0)
+-------------------------------+
| x y tagy tagx ny nx |
|-------------------------------|
1. | 1 3 0 0 1 0 |
2. | 1 3 0 0 1 0 |
3. | 1 3 1 1 1 2 |
4. | 2 3 0 1 0 2 |
5. | . 3 0 0 0 2 |
6. | 2 . 0 0 0 0 |
7. | 2 . 0 0 0 0 |
+-------------------------------+
As seen, the code works for filling x=1
and not filling y=3
(line 5). Yet, it does not fill lines 6 and 7 because tagy=1
only appears once (x=1
).
Upvotes: 1
Views: 2289
Reputation: 339
This is a bit clunky, but it should work:
bysort x: egen temp=sd(x) if x!=.
bysort x (y): replace y=y[1] if temp==0
drop temp
Since the standard deviation of a constant is zero, temp=0 if non-missing x's are all the same.
Upvotes: 0