Reputation: 603
All,
I have what may or may not be a complicated question of merging two different types of data in R. I'm working with a directed dyad-year data frame (A vs. B, B vs. A). I want to read in, or merge, data from a country-year data set in the following manner.
Assume variable x
from the country-year data set (CY
) is the variable of interest I'm trying to merge into the directed dyad-year data set (DDY
). In a simplified version of just four cross-sectional units (A, B, C, D) over three years (1990-1992), it looks like this.
country year x
A 1990 6.2352
A 1991 7.2342
A 1992 8.3902
B 1990 2.2342
B 1991 5.1292
B 1992 1.0001
C 1990 4.1202
C 1991 9.1202
C 1992 1.2011
D 1990 1.2910
D 1991 5.0001
D 1992 2.1111
I'm working on the directed dyad-year data set (DDY
), which already has many other variables of interest. Basically, I want to take x
from CY
and create x1
and x2
in DDY
, matching x1
in the directed dyad-year data set with the corresponding value of x
in a given year from the country-year data and doing the same for x2
from variable x
in the country-year data.
In short, I want DDY
to look like this.
country1 country2 year x1 x2
A B 1990 6.2352 2.2342
A B 1991 7.2342 5.1292
A B 1992 8.3902 1.0001
A C 1990 6.2352 4.1202
A C 1991 7.2342 9.1202
A C 1992 8.3902 1.2011
A D 1990 6.2352 1.2910
A D 1991 7.2342 5.0001
A D 1992 8.3902 2.1111
B A 1990 2.2342 6.2352
B A 1991 5.1292 7.2342
B A 1992 1.0001 8.3902
...
The data go on from there for each directed dyad-year pairing.
What I don't know is if this is a delicate process using the merge
command or if some other route is most appropriate. Any input would be appreciated and I'll offer any clarification about the data with which I'm working, should it be helpful in finding a solution.
This previously asked question is clearly relevant. However, because no reproducible code was provided in asking the question, the answer seems a bit obtuse for what I want to do. If that solution is the route to go, it may be helpful to clarify what it's doing.
Thanks.
Reproducible code follows.
country <- c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D")
year <- c(1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992)
x <- c(6.2352, 7.2342, 8.3902, 2.2342, 5.1292, 1.0001, 4.1202, 9.1202, 1.2011, 1.2910, 5.0001, 2.1111)
CY <- data.frame(country=country, year=year, x=x)
CY
country1 <- c("A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D", "D", "D", "D", "D")
country2 <- c("B", "B", "B", "C", "C", "C", "D", "D", "D", "A", "A", "A", "C", "C", "C", "D", "D", "D", "A", "A", "A", "B", "B", "B", "D", "D", "D", "A", "A", "A", "B", "B", "B", "C", "C", "C")
year <- c(1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992)
DDY <- data.frame(country1=country1, country2=country2, year=year)
DDY
Upvotes: 2
Views: 1180
Reputation: 270170
1. Just CY This can be done using only CY
like this:
library(sqldf)
sqldf("select A.country country1, B.country country2, year, A.x x1, B.x x2
from CY A join CY B using (year)
where A.country != B.country
order by A.country, B.country")
which gives:
country1 country2 year x1 x2
1 A B 1990 6.2352 2.2342
2 A B 1991 7.2342 5.1292
3 A B 1992 8.3902 1.0001
4 A C 1990 6.2352 4.1202
5 A C 1991 7.2342 9.1202
6 A C 1992 8.3902 1.2011
7 A D 1990 6.2352 1.2910
8 A D 1991 7.2342 5.0001
9 A D 1992 8.3902 2.1111
10 B A 1990 2.2342 6.2352
11 B A 1991 5.1292 7.2342
12 B A 1992 1.0001 8.3902
13 B C 1990 2.2342 4.1202
14 B C 1991 5.1292 9.1202
15 B C 1992 1.0001 1.2011
16 B D 1990 2.2342 1.2910
17 B D 1991 5.1292 5.0001
18 B D 1992 1.0001 2.1111
19 C A 1990 4.1202 6.2352
20 C A 1991 9.1202 7.2342
21 C A 1992 1.2011 8.3902
22 C B 1990 4.1202 2.2342
23 C B 1991 9.1202 5.1292
24 C B 1992 1.2011 1.0001
25 C D 1990 4.1202 1.2910
26 C D 1991 9.1202 5.0001
27 C D 1992 1.2011 2.1111
28 D A 1990 1.2910 6.2352
29 D A 1991 5.0001 7.2342
30 D A 1992 2.1111 8.3902
31 D B 1990 1.2910 2.2342
32 D B 1991 5.0001 5.1292
33 D B 1992 2.1111 1.0001
34 D C 1990 1.2910 4.1202
35 D C 1991 5.0001 9.1202
36 D C 1992 2.1111 1.2011
2. CY and DDY
Or, to merge CY
with DDY
try this:
sqldf("select A.country country1, B.country country2, A.year, A.x x1, B.x x2
from DDY join CY A join CY B
on DDY.country1 = A.country and DDY.year = A.year
and DDY.country2 = B.country and DDY.year = B.year
order by A.country, B.country")
which gives this:
country1 country2 year x1 x2
1 A B 1990 6.2352 2.2342
2 A B 1991 7.2342 5.1292
3 A B 1992 8.3902 1.0001
4 A C 1990 6.2352 4.1202
5 A C 1991 7.2342 9.1202
6 A C 1992 8.3902 1.2011
7 A D 1990 6.2352 1.2910
8 A D 1991 7.2342 5.0001
9 A D 1992 8.3902 2.1111
10 B A 1990 2.2342 6.2352
11 B A 1991 5.1292 7.2342
12 B A 1992 1.0001 8.3902
13 B C 1990 2.2342 4.1202
14 B C 1991 5.1292 9.1202
15 B C 1992 1.0001 1.2011
16 B D 1990 2.2342 1.2910
17 B D 1991 5.1292 5.0001
18 B D 1992 1.0001 2.1111
19 C A 1990 4.1202 6.2352
20 C A 1991 9.1202 7.2342
21 C A 1992 1.2011 8.3902
22 C B 1990 4.1202 2.2342
23 C B 1991 9.1202 5.1292
24 C B 1992 1.2011 1.0001
25 C D 1990 4.1202 1.2910
26 C D 1991 9.1202 5.0001
27 C D 1992 1.2011 2.1111
28 D A 1990 1.2910 6.2352
29 D A 1991 5.0001 7.2342
30 D A 1992 2.1111 8.3902
31 D B 1990 1.2910 2.2342
32 D B 1991 5.0001 5.1292
33 D B 1992 2.1111 1.0001
34 D C 1990 1.2910 4.1202
35 D C 1991 5.0001 9.1202
36 D C 1992 2.1111 1.2011
UPDATE: Added solution using both CY
and DDY
.
Upvotes: 1
Reputation: 2384
Here is an alternative to create DDY from CY without resorting to SQL synthax.
ind <- expand.grid(1:nrow(CY), 1:nrow(CY))
CY.1 <- CY[ind[, 1], ]
CY.2 <- CY[ind[, 2], ]
bool <- (CY.1$year == CY.2$year) & (CY.1$country != CY.2$country)
DDY <- data.frame(country1 = CY.1$country[bool],
country2 = CY.2$country[bool],
year = CY.1$year[bool],
x1 = CY.1$x[bool],
x2 = CY.2$x[bool])
DDY <- DDY[order(country1, country2), ]
DDY
Upvotes: 2