user3274019
user3274019

Reputation: 13

How do I take data from one observation and apply it to one other observation within a group?

An unmarried couple is living together in a house with other people. To isolate how much that couple makes I need to add the two incomes together. I am using variables that act as pointers that give the partners_id. Using the partners_id, id , and individual_income how do I apply partner's income to his/her partner?

This was my attempt below:

summarize id, meanonly
capture gen partners_income = 0

forvalue ln = 1/`r(max)' {
    bys household (id): ///
    egen link_`ln' = total(individual_income) if partners_location==`ln')
    replace partners_income = link_`ln' if link_`ln' > 0 & id == `ln'
    drop link_*
} 

Upvotes: 1

Views: 340

Answers (1)

Nick Cox
Nick Cox

Reputation: 37208

There is general advice in this FAQ.

It can take longer to write a smart way to do this than to use a quick-and-dirty approach.

However, there is a smarter way.

Brute solution

Quick here means relatively quick to code; this isn't guaranteed quick for a very large dataset.

gen partners_income = . 
gen problem = 0 

The proper initialisation of the partner's income variable is to missing, not zero. Not knowing an income and the income being zero are different conditions. For example, if someone doesn't have a partner, the income will certainly be missing. (If at a later stage, you want to treat missings as zeros, that's up to you, but you should keep them distinct at this stage.)

The reason for the problem variable will become apparent.

I can't see a reason for your capture.

Now we can loop:

quietly forval i = 1/`=_N' { 
      su individual_income if id == partners_id[`i'], meanonly 
      replace partners_income = r(max) in `i' 
      if r(N) > 1 replace problem = r(N) in `i' 
}

So, the logic is

foreach observation

  1. find the partner's identifier
  2. find that income: summarize, meanonly is fast
  3. that should be one value, so it should be immaterial whether we pick it up from the results of summarize as the maximum, minimum, or mean
  4. but if summarize finds more than one value, something is not as assumed (mistakes over identifiers, or multiple partners); later we edit if problem and look at those observations.

Notes:

We can make comparison safer by restricting computations to the same household by modifying

if id == partners_id[`i']

to

if id == partners_id[`i'] & household == household[`i'] 

In one place you have the variable partners_location which looks like a typo for partners_id.

Cute solution

Assuming that partners name each other as partner (and this is not the forum to explore exceptions), then couples have a joint identity which we obtain by sorting "John Joanna" and "Joanna John" to "Joanna John" or the equivalent with numeric identifiers:

gen first = cond(id < partner_id, id, partner_id) 
gen second = cond(id < partner_id, partner_id, id) 
egen joint = concat(first second), p(" ") 

first and second just mean in numeric or alphanumeric order; this works for numeric and string identifiers. You may need to slap on an exclusion clause such as

if !missing(partner_id) 

Now

bysort household joint : gen partners_income = income[3 - _n] if _N == 2 

Get it? Each distinct combination of household and joint should be precisely 2 observations for us to be interested (hence the qualifier if _N == 2). If that's true then 3 - _n gives us the subscript of the other partner as if _n is 1 then 3 - _n is 2 and vice versa. Under by: subscripts are always applied within groups, so that _n runs 1, 2, and so forth in each distinct group.

If this seems cryptic, it is all spelled out in Cox, N.J. 2008. The problem of split identity, or how to group dyads. Stata Journal 8(4): 588-591 which is accessible as a .pdf.

Upvotes: 1

Related Questions