steve
steve

Reputation: 603

Merging country-year data into directed dyad-year data in R

All,

I have what may or may not be a complicated question of merging two different types of data in R. I'm working with a directed dyad-year data frame (A vs. B, B vs. A). I want to read in, or merge, data from a country-year data set in the following manner.

Assume variable x from the country-year data set (CY) is the variable of interest I'm trying to merge into the directed dyad-year data set (DDY). In a simplified version of just four cross-sectional units (A, B, C, D) over three years (1990-1992), it looks like this.

country     year      x
  A         1990    6.2352
  A         1991    7.2342
  A         1992    8.3902
  B         1990    2.2342
  B         1991    5.1292
  B         1992    1.0001
  C         1990    4.1202
  C         1991    9.1202
  C         1992    1.2011
  D         1990    1.2910
  D         1991    5.0001
  D         1992    2.1111

I'm working on the directed dyad-year data set (DDY), which already has many other variables of interest. Basically, I want to take x from CY and create x1 and x2 in DDY, matching x1 in the directed dyad-year data set with the corresponding value of x in a given year from the country-year data and doing the same for x2 from variable x in the country-year data.

In short, I want DDY to look like this.

country1     country2     year     x1          x2
   A           B          1990    6.2352     2.2342
   A           B          1991    7.2342     5.1292
   A           B          1992    8.3902     1.0001
   A           C          1990    6.2352     4.1202
   A           C          1991    7.2342     9.1202
   A           C          1992    8.3902     1.2011
   A           D          1990    6.2352     1.2910
   A           D          1991    7.2342     5.0001
   A           D          1992    8.3902     2.1111
   B           A          1990    2.2342     6.2352
   B           A          1991    5.1292     7.2342
   B           A          1992    1.0001     8.3902
   ...

The data go on from there for each directed dyad-year pairing.

What I don't know is if this is a delicate process using the merge command or if some other route is most appropriate. Any input would be appreciated and I'll offer any clarification about the data with which I'm working, should it be helpful in finding a solution.

This previously asked question is clearly relevant. However, because no reproducible code was provided in asking the question, the answer seems a bit obtuse for what I want to do. If that solution is the route to go, it may be helpful to clarify what it's doing.

Thanks.

Reproducible code follows.

country <- c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D")
year <- c(1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992)
x <- c(6.2352, 7.2342, 8.3902, 2.2342, 5.1292, 1.0001, 4.1202, 9.1202, 1.2011, 1.2910, 5.0001, 2.1111)
CY <- data.frame(country=country, year=year, x=x)
CY

country1 <- c("A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D", "D", "D", "D", "D")
country2 <- c("B", "B", "B", "C", "C", "C", "D", "D", "D", "A", "A", "A", "C", "C", "C", "D", "D", "D", "A", "A", "A", "B", "B", "B", "D", "D", "D", "A", "A", "A", "B", "B", "B", "C", "C", "C")
year <- c(1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992, 1990, 1991, 1992)
DDY <- data.frame(country1=country1, country2=country2, year=year)
DDY

Upvotes: 2

Views: 1180

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270170

1. Just CY This can be done using only CY like this:

library(sqldf)

sqldf("select A.country country1, B.country country2, year, A.x x1, B.x x2 
   from CY A join CY B using (year) 
   where A.country != B.country 
   order by A.country, B.country")

which gives:

   country1 country2 year     x1     x2
1         A        B 1990 6.2352 2.2342
2         A        B 1991 7.2342 5.1292
3         A        B 1992 8.3902 1.0001
4         A        C 1990 6.2352 4.1202
5         A        C 1991 7.2342 9.1202
6         A        C 1992 8.3902 1.2011
7         A        D 1990 6.2352 1.2910
8         A        D 1991 7.2342 5.0001
9         A        D 1992 8.3902 2.1111
10        B        A 1990 2.2342 6.2352
11        B        A 1991 5.1292 7.2342
12        B        A 1992 1.0001 8.3902
13        B        C 1990 2.2342 4.1202
14        B        C 1991 5.1292 9.1202
15        B        C 1992 1.0001 1.2011
16        B        D 1990 2.2342 1.2910
17        B        D 1991 5.1292 5.0001
18        B        D 1992 1.0001 2.1111
19        C        A 1990 4.1202 6.2352
20        C        A 1991 9.1202 7.2342
21        C        A 1992 1.2011 8.3902
22        C        B 1990 4.1202 2.2342
23        C        B 1991 9.1202 5.1292
24        C        B 1992 1.2011 1.0001
25        C        D 1990 4.1202 1.2910
26        C        D 1991 9.1202 5.0001
27        C        D 1992 1.2011 2.1111
28        D        A 1990 1.2910 6.2352
29        D        A 1991 5.0001 7.2342
30        D        A 1992 2.1111 8.3902
31        D        B 1990 1.2910 2.2342
32        D        B 1991 5.0001 5.1292
33        D        B 1992 2.1111 1.0001
34        D        C 1990 1.2910 4.1202
35        D        C 1991 5.0001 9.1202
36        D        C 1992 2.1111 1.2011

2. CY and DDY

Or, to merge CY with DDY try this:

sqldf("select A.country country1, B.country country2, A.year, A.x x1, B.x x2 
   from DDY join CY A join CY B 
   on DDY.country1 = A.country and DDY.year = A.year 
   and DDY.country2 = B.country and DDY.year = B.year
   order by A.country, B.country")

which gives this:

   country1 country2 year     x1     x2
1         A        B 1990 6.2352 2.2342
2         A        B 1991 7.2342 5.1292
3         A        B 1992 8.3902 1.0001
4         A        C 1990 6.2352 4.1202
5         A        C 1991 7.2342 9.1202
6         A        C 1992 8.3902 1.2011
7         A        D 1990 6.2352 1.2910
8         A        D 1991 7.2342 5.0001
9         A        D 1992 8.3902 2.1111
10        B        A 1990 2.2342 6.2352
11        B        A 1991 5.1292 7.2342
12        B        A 1992 1.0001 8.3902
13        B        C 1990 2.2342 4.1202
14        B        C 1991 5.1292 9.1202
15        B        C 1992 1.0001 1.2011
16        B        D 1990 2.2342 1.2910
17        B        D 1991 5.1292 5.0001
18        B        D 1992 1.0001 2.1111
19        C        A 1990 4.1202 6.2352
20        C        A 1991 9.1202 7.2342
21        C        A 1992 1.2011 8.3902
22        C        B 1990 4.1202 2.2342
23        C        B 1991 9.1202 5.1292
24        C        B 1992 1.2011 1.0001
25        C        D 1990 4.1202 1.2910
26        C        D 1991 9.1202 5.0001
27        C        D 1992 1.2011 2.1111
28        D        A 1990 1.2910 6.2352
29        D        A 1991 5.0001 7.2342
30        D        A 1992 2.1111 8.3902
31        D        B 1990 1.2910 2.2342
32        D        B 1991 5.0001 5.1292
33        D        B 1992 2.1111 1.0001
34        D        C 1990 1.2910 4.1202
35        D        C 1991 5.0001 9.1202
36        D        C 1992 2.1111 1.2011

UPDATE: Added solution using both CY and DDY.

Upvotes: 1

QuantIbex
QuantIbex

Reputation: 2384

Here is an alternative to create DDY from CY without resorting to SQL synthax.

ind  <- expand.grid(1:nrow(CY), 1:nrow(CY))
CY.1 <- CY[ind[, 1], ]
CY.2 <- CY[ind[, 2], ]
bool <- (CY.1$year == CY.2$year) & (CY.1$country != CY.2$country)
DDY  <- data.frame(country1 = CY.1$country[bool], 
                   country2 = CY.2$country[bool],
                   year     = CY.1$year[bool],
                   x1       = CY.1$x[bool],
                   x2       = CY.2$x[bool])
DDY  <- DDY[order(country1, country2), ]
DDY

Upvotes: 2

Related Questions