Reputation: 11
Suppose that in Stata I should have 10 people per household, so each has their own ID in sequence. What if the data only have information like the table below?
hhid | pid |
---|---|
1 | 1 |
1 | 2 |
1 | 5 |
1 | 6 |
1 | 7 |
1 | 8 |
As you can see, there are only 6 observations above. What should I do if I want to add new observations so that my data will look like this:
hhid | id |
---|---|
1 | 1 |
1 | 2 |
1 | 3 |
1 | 4 |
1 | 5 |
1 | 6 |
1 | 7 |
1 | 8 |
1 | 9 |
1 | 10 |
What if there is more than one household with the same issues?
Edit: Answered
Thank you for your answer @Nick Cox!
Yes, I didn't put out any context into the question. What I wanted to do was to use a variable "line biological mother" (bm
) to create a new variable: "biological mother's height" (mheight
) by using
bys hhid: gen mheight = height[bm]
In some cases, like in the table below, the fifth pid
is in the third row due to some missing observation, while the first pid
is placed 'right' in the first row.
hhid | pid | bm | height |
---|---|---|---|
1 | 1 | . | 173 |
1 | 2 | 1 | 180 |
1 | 5 | . | 165 |
1 | 6 | 5 | 82 |
1 | 7 | 5 | 90 |
1 | 8 | 5 | 120 |
I don't know if there's an easier way to fix this, but fillin
will kinda sort things out.
Upvotes: 0
Views: 236
Reputation: 37208
So long as there is at least one household with all 10 personal identifiers, fillin
gets you there directly. Note this way of showing data examples, as explained in the Stata tag wiki.
* Example generated by -dataex-. For more info, type help dataex
clear
input int hhid byte pid
1 1
1 2
1 5
1 6
1 7
1 8
999 1
999 2
999 3
999 4
999 5
999 6
999 7
999 8
999 9
999 10
end
. fillin hhid pid
. l
+----------------------+
| hhid pid _fillin |
|----------------------|
1. | 1 1 0 |
2. | 1 2 0 |
3. | 1 3 1 |
4. | 1 4 1 |
5. | 1 5 0 |
|----------------------|
6. | 1 6 0 |
7. | 1 7 0 |
8. | 1 8 0 |
9. | 1 9 1 |
10. | 1 10 1 |
|----------------------|
11. | 999 1 0 |
12. | 999 2 0 |
13. | 999 3 0 |
14. | 999 4 0 |
15. | 999 5 0 |
|----------------------|
16. | 999 6 0 |
17. | 999 7 0 |
18. | 999 8 0 |
19. | 999 9 0 |
20. | 999 10 0 |
+----------------------+
However, without more explanation this sounds like a recipe for bloating your dataset with useless extra observations.
EDIT
The underlying problem has been revealed as looking up the height of the birth mother within each household. Consider this data example:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(hhid pid bm) int height
1 1 . 173
1 2 1 180
1 5 . 165
1 6 5 82
1 7 5 90
1 8 5 120
end
The existing birth mother identifier is almost fit for purpose, but fill in missings with an identifier that does not occur in the data, say 0:
. gen bm2 = cond(bm < ., bm, 0)
This is to ensure that the calculation produces missing, rather than a mean over all observations, whenever that is the right answer. (The syntax of rangestat
is often to use .
to mean "anything appropriate", a syntax also used in some contexts in official Stata.)
Install rangestat
and then apply.
. ssc install rangestat
. rangestat bm_height=height, int(pid bm2 bm2) by(hhid)
. list, sep(0)
+-------------------------------------------+
| hhid pid bm height bm2 bm_hei~t |
|-------------------------------------------|
1. | 1 1 . 173 0 . |
2. | 1 2 1 180 1 173 |
3. | 1 5 . 165 0 . |
4. | 1 6 5 82 5 165 |
5. | 1 7 5 90 5 165 |
6. | 1 8 5 120 5 165 |
+-------------------------------------------+
So for two people with no documented birth mother it is fair that missing is returned as the birth mother's height. Otherwise person 2 has birth mother 1 and persons 6, 7, 8 have birth mother 5, so the code looks in the appropriate observation in each case and copies the value there to the new variable. The by()
option ensures that this is done within households.
The help for rangestat
has a similar example.
Upvotes: 0