Insan Alfarizy
Insan Alfarizy

Reputation: 11

Add missing sequence of ID into observation

Suppose that in Stata I should have 10 people per household, so each has their own ID in sequence. What if the data only have information like the table below?

hhid pid
1 1
1 2
1 5
1 6
1 7
1 8

As you can see, there are only 6 observations above. What should I do if I want to add new observations so that my data will look like this:

hhid id
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10

What if there is more than one household with the same issues?

Edit: Answered

Thank you for your answer @Nick Cox! 

Yes, I didn't put out any context into the question. What I wanted to do was to use a variable "line biological mother" (bm) to create a new variable: "biological mother's height" (mheight) by using

bys hhid: gen mheight = height[bm]

In some cases, like in the table below, the fifth pid is in the third row due to some missing observation, while the first pid is placed 'right' in the first row.

hhid pid bm height
1 1 . 173
1 2 1 180
1 5 . 165
1 6 5 82
1 7 5 90
1 8 5 120

I don't know if there's an easier way to fix this, but fillin will kinda sort things out.

Upvotes: 0

Views: 236

Answers (1)

Nick Cox
Nick Cox

Reputation: 37208

So long as there is at least one household with all 10 personal identifiers, fillin gets you there directly. Note this way of showing data examples, as explained in the Stata tag wiki.

* Example generated by -dataex-. For more info, type help dataex
clear
input int hhid byte pid
  1  1
  1  2
  1  5
  1  6
  1  7
  1  8
999  1
999  2
999  3
999  4
999  5
999  6
999  7
999  8
999  9
999 10
end

. fillin hhid pid

. l

     +----------------------+
     | hhid   pid   _fillin |
     |----------------------|
  1. |    1     1         0 |
  2. |    1     2         0 |
  3. |    1     3         1 |
  4. |    1     4         1 |
  5. |    1     5         0 |
     |----------------------|
  6. |    1     6         0 |
  7. |    1     7         0 |
  8. |    1     8         0 |
  9. |    1     9         1 |
 10. |    1    10         1 |
     |----------------------|
 11. |  999     1         0 |
 12. |  999     2         0 |
 13. |  999     3         0 |
 14. |  999     4         0 |
 15. |  999     5         0 |
     |----------------------|
 16. |  999     6         0 |
 17. |  999     7         0 |
 18. |  999     8         0 |
 19. |  999     9         0 |
 20. |  999    10         0 |
     +----------------------+

However, without more explanation this sounds like a recipe for bloating your dataset with useless extra observations.

EDIT

The underlying problem has been revealed as looking up the height of the birth mother within each household. Consider this data example:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(hhid pid bm) int height
1 1 . 173
1 2 1 180
1 5 . 165
1 6 5  82
1 7 5  90
1 8 5 120
end

The existing birth mother identifier is almost fit for purpose, but fill in missings with an identifier that does not occur in the data, say 0:

. gen bm2 = cond(bm < ., bm, 0)

This is to ensure that the calculation produces missing, rather than a mean over all observations, whenever that is the right answer. (The syntax of rangestat is often to use . to mean "anything appropriate", a syntax also used in some contexts in official Stata.)

Install rangestat and then apply.

. ssc install rangestat 

. rangestat bm_height=height, int(pid bm2 bm2) by(hhid)


. list, sep(0)

     +-------------------------------------------+
     | hhid   pid   bm   height   bm2   bm_hei~t |
     |-------------------------------------------|
  1. |    1     1    .      173     0          . |
  2. |    1     2    1      180     1        173 |
  3. |    1     5    .      165     0          . |
  4. |    1     6    5       82     5        165 |
  5. |    1     7    5       90     5        165 |
  6. |    1     8    5      120     5        165 |
     +-------------------------------------------+

So for two people with no documented birth mother it is fair that missing is returned as the birth mother's height. Otherwise person 2 has birth mother 1 and persons 6, 7, 8 have birth mother 5, so the code looks in the appropriate observation in each case and copies the value there to the new variable. The by() option ensures that this is done within households.

The help for rangestat has a similar example.

Upvotes: 0

Related Questions