korbras
korbras

Reputation: 1

Is there a way of merging in Stata when the keys might be mixed in different variables?

I am trying to merge 1:1 two datasets, which correspond to surveys of the same individuals at different periods; no unique identifier was constructed.

What was gathered was email addresses, although those are stored in three different variables in each dataset. People did not necessarily write the same address in each.

Does anyone know how or if such thing can be done?

clear
set more off

input str16 mai1 str16 email2   str16 email3         str16 data1
"[email protected]"  "[email protected]"  "[email protected]" "aaaa1"
"[email protected]"   ""              "[email protected]"        "bbbb1"
end

tempfile first
save "`first'"

clear

input str16 mai1    str16 email2    str16 email3   str16 data1
""                  "[email protected]"   ""      ""     "bbbb2"
"[email protected]"   ""              ""             "aaaa2"
end

Upvotes: 0

Views: 438

Answers (1)

Nick Cox
Nick Cox

Reputation: 37208

A way to get started is to hold an election among the three addresses for each individual in each dataset:

If you sort on the addresses for each individual, then after sorting

  1. If all email addresses agree, the 2nd will be the same as all.

  2. If two email addresses agree, the 2nd will be that agreed address, because those two will sort last or first, depending on the alphabet, but either way one of the two will be second. (The principle is like the median of 3 numbers: with cases either like 1, 1, 2 or like 1, 2, 2, if any 2 numbers agree out of 3, that number is the median; so also with strings sorted alphanumerically.)

  3. If three email addresses disagree, the 2nd will be no worse as a guess than any other.

  4. If one email address is blank, and the others disagree, the same applies.

  5. If two email addresses are blank, use the one that was given.

  6. If all are blank, goodness knows.

If you do this in both datasets, and merge 1:1 on the results, then look at what doesn't match after the merge and consider your options. There is a choice ranging from abandoning what doesn't match to trying something else.

clear 
input str16 mai1 str16 email2   str16 email3         str16 data1
"[email protected]"  "[email protected]"  "[email protected]" "aaaa1"
"[email protected]"   ""              "[email protected]"        "bbbb1"
end

gen long id = _n 

rename mai1 email1 
reshape long email , i(id) j(which)
bysort id (email) : gen EMAIL = email[2]
replace EMAIL = email[3] if EMAIL == "" 

reshape wide email, i(id) j(which)

list 

     +----------------------------------------------------------------------------+
     | id         email1         email2             email3   data1          EMAIL |
     |----------------------------------------------------------------------------|
  1. |  1   [email protected]   [email protected]   [email protected]   aaaa1   [email protected] |
  2. |  2    [email protected]                       [email protected]   bbbb1    [email protected] |
     +----------------------------------------------------------------------------+

.

You can extend this technique to find second and at worst third guesses in each case.

This technique will catch 1 of 2 cases in your data example, but careful study will find the second match.

Upvotes: 2

Related Questions