Reputation: 1
I am trying to merge 1:1
two datasets, which correspond to surveys of the same individuals at different periods; no unique identifier was constructed.
What was gathered was email addresses, although those are stored in three different variables in each dataset. People did not necessarily write the same address in each.
Does anyone know how or if such thing can be done?
clear
set more off
input str16 mai1 str16 email2 str16 email3 str16 data1
"[email protected]" "[email protected]" "[email protected]" "aaaa1"
"[email protected]" "" "[email protected]" "bbbb1"
end
tempfile first
save "`first'"
clear
input str16 mai1 str16 email2 str16 email3 str16 data1
"" "[email protected]" "" "" "bbbb2"
"[email protected]" "" "" "aaaa2"
end
Upvotes: 0
Views: 438
Reputation: 37208
A way to get started is to hold an election among the three addresses for each individual in each dataset:
If you sort on the addresses for each individual, then after sorting
If all email addresses agree, the 2nd will be the same as all.
If two email addresses agree, the 2nd will be that agreed address, because those two will sort last or first, depending on the alphabet, but either way one of the two will be second. (The principle is like the median of 3 numbers: with cases either like 1, 1, 2 or like 1, 2, 2, if any 2 numbers agree out of 3, that number is the median; so also with strings sorted alphanumerically.)
If three email addresses disagree, the 2nd will be no worse as a guess than any other.
If one email address is blank, and the others disagree, the same applies.
If two email addresses are blank, use the one that was given.
If all are blank, goodness knows.
If you do this in both datasets, and merge 1:1
on the results, then look at what doesn't match after the merge
and consider your options. There is a choice ranging from abandoning what doesn't match to trying something else.
clear
input str16 mai1 str16 email2 str16 email3 str16 data1
"[email protected]" "[email protected]" "[email protected]" "aaaa1"
"[email protected]" "" "[email protected]" "bbbb1"
end
gen long id = _n
rename mai1 email1
reshape long email , i(id) j(which)
bysort id (email) : gen EMAIL = email[2]
replace EMAIL = email[3] if EMAIL == ""
reshape wide email, i(id) j(which)
list
+----------------------------------------------------------------------------+
| id email1 email2 email3 data1 EMAIL |
|----------------------------------------------------------------------------|
1. | 1 [email protected] [email protected] [email protected] aaaa1 [email protected] |
2. | 2 [email protected] [email protected] bbbb1 [email protected] |
+----------------------------------------------------------------------------+
.
You can extend this technique to find second and at worst third guesses in each case.
This technique will catch 1 of 2 cases in your data example, but careful study will find the second match.
Upvotes: 2