Dotty Ramizo
Dotty Ramizo

Reputation: 33

create dataset based on all possible pairs of identifiers within each group in Stata

I have a dataset that looks like this:

country1 country2 group
China Philippines 68a
China Thailand 68a
Bahamas Jamaica 176a
Bahamas Grenada 176a

I need to transform the above dataset into like this:

country1 country2 group
China Philippines 68a
China Thailand 68a
Philippines China 68a
Philippines Thailand 68a
Thailand China 68a
Thailand Philippines 68a
Bahamas Jamaica 176a
Bahamas Grenada 176a
Jamaica Bahamas 176a
Jamaica Grenada 176a
Grenada Bahamas 176a
Grenada Jamaica 176a

I tried my best to follow the Stata code in this article: https://www.stata.com/support/faqs/data-management/expanding-datasets-to-all-pairs/. However, I ended up with a dataset that looks like this:

country1 country2 group
China Philippines 68a
China Philippines 68a
China Thailand 68a
China Thailand 68a
Bahamas Jamaica 176a
Bahamas Jamaica 176a
Bahamas Grenada 176a
Bahamas Grenada 176a

I'm not sure what I'm doing wrong.

Upvotes: 1

Views: 611

Answers (2)

Dotty Ramizo
Dotty Ramizo

Reputation: 33

I was able to generate my desired dataset with my codes below but I'm hoping to find a more straightforward way to code it.

use "original dataset", clear
drop country2
save "temp_country1", replace

use "original dataset", clear
drop country1
ren country2 country1 
append using "temp_country1"

//drop duplicates//
sort number country1
quietly by number country1:  gen dup = cond(_N==1,0,_n)
drop if dup>1
drop dup
save "temp_country1_final", replace

use "temp_country1_final", clear
ren country1 country2
save "temp_country2.dta", replace

use "temp_country1_final", clear
joinby number using "temp_country2.dta"
order country1 country2  number name
drop if country1==country2

Upvotes: 0

JR96
JR96

Reputation: 973

I think the issue is because your unique identifier is really the combination of two columns (country1 and country2) whereas in the example you are following there is a unique id column. Here is how I would do this with your example, provided your dataset isn't incredibly large:

clear
input str40(country1 country2 group)
"China" "Philippines"   "68a"
"China" "Thailand"  "68a"
"Bahamas"   "Jamaica"   "176a"
"Bahamas"   "Grenada"   "176a"
end

egen pair_id = group(country1 country2) // Create unique pair id
reshape long country, i(group pair_id) j(j ) // reshape all countries long
drop pair_id j
rename country country1

* create duplicate dataset to fulljoin
preserve
    rename country country2
    keep country2 group
    tempfile cross
    save `cross', replace
restore

joinby group using `cross' // full join
drop if country1 == country2

* Some tidying to match example output
order country1 country2 group
gsort -group country1 country2
duplicates drop

Upvotes: 2

Related Questions