I have a dataset that looks like this: country1 country2 group China Philippines 68a China Thailand 68a Bahamas Jamaica 176a Bahamas Grenada 176a I need to transform the above dataset into like this: country1 country2 group China Philippines 68a China Thailand 68a Philippines China 68a Philippines Thailand 68a Thailand China 68a Thailand Philippines 68a Bahamas Jamaica 176a Bahamas Grenada 176a Jamaica Bahamas 176a Jamaica Grenada 176a Grenada Bahamas 176a Grenada Jamaica 176a I tried my best to follow the Stata code in this article: https://www.stata.com/support/faqs/data-management/expanding-datasets-to-all-pairs/ . However, I ended up with a dataset that looks like this: country1 country2 group China Philippines 68a China Philippines 68a China Thailand 68a China Thailand 68a Bahamas Jamaica 176a Bahamas Jamaica 176a Bahamas Grenada 176a Bahamas Grenada 176a I'm not sure what I'm doing wrong.

Reputation: 33

create dataset based on all possible pairs of identifiers within each group in Stata

I have a dataset that looks like this:

country1	country2	group
China	Philippines	68a
China	Thailand	68a
Bahamas	Jamaica	176a
Bahamas	Grenada	176a

I need to transform the above dataset into like this:

country1	country2	group
China	Philippines	68a
China	Thailand	68a
Philippines	China	68a
Philippines	Thailand	68a
Thailand	China	68a
Thailand	Philippines	68a
Bahamas	Jamaica	176a
Bahamas	Grenada	176a
Jamaica	Bahamas	176a
Jamaica	Grenada	176a
Grenada	Bahamas	176a
Grenada	Jamaica	176a

I tried my best to follow the Stata code in this article: https://www.stata.com/support/faqs/data-management/expanding-datasets-to-all-pairs/. However, I ended up with a dataset that looks like this:

country1	country2	group
China	Philippines	68a
China	Philippines	68a
China	Thailand	68a
China	Thailand	68a
Bahamas	Jamaica	176a
Bahamas	Jamaica	176a
Bahamas	Grenada	176a
Bahamas	Grenada	176a

I'm not sure what I'm doing wrong.

Upvotes: 1

Answers (2)

Dotty Ramizo

Reputation: 33

I was able to generate my desired dataset with my codes below but I'm hoping to find a more straightforward way to code it.

use "original dataset", clear
drop country2
save "temp_country1", replace

use "original dataset", clear
drop country1
ren country2 country1 
append using "temp_country1"

//drop duplicates//
sort number country1
quietly by number country1:  gen dup = cond(_N==1,0,_n)
drop if dup>1
drop dup
save "temp_country1_final", replace

use "temp_country1_final", clear
ren country1 country2
save "temp_country2.dta", replace

use "temp_country1_final", clear
joinby number using "temp_country2.dta"
order country1 country2  number name
drop if country1==country2

Upvotes: 0

JR96

Reputation: 973

I think the issue is because your unique identifier is really the combination of two columns (country1 and country2) whereas in the example you are following there is a unique id column. Here is how I would do this with your example, provided your dataset isn't incredibly large:

clear
input str40(country1 country2 group)
"China" "Philippines"   "68a"
"China" "Thailand"  "68a"
"Bahamas"   "Jamaica"   "176a"
"Bahamas"   "Grenada"   "176a"
end

egen pair_id = group(country1 country2) // Create unique pair id
reshape long country, i(group pair_id) j(j ) // reshape all countries long
drop pair_id j
rename country country1

* create duplicate dataset to fulljoin
preserve
    rename country country2
    keep country2 group
    tempfile cross
    save `cross', replace
restore

joinby group using `cross' // full join
drop if country1 == country2

* Some tidying to match example output
order country1 country2 group
gsort -group country1 country2
duplicates drop

Upvotes: 2

create dataset based on all possible pairs of identifiers within each group in Stata

Answers (2)

Related Questions