Merging on pairs of values in one column of each data frame

Question

I am trying to merge two data frames with columns of different lengths and rows.To give the exact idea DF1 is:

     ID     year    freq1   mun    
       1    2005    2     61137
       1    2006    1     61383
       2    2005    3     14520
       2    2006    2     14604
       4    2005    3     101423
       4    2006    1     102257
       6    2005    0     39039
       6    2006    1     39346

Whereas DF2 is:

      ID        year    freq2   mun
       1        2004    5     60857
       1        2005    3     61137
       2        2004    4     14278
       2        2005    4     14520
       3        2004    2     22563
       3        2005    0     22635
       4        2004    6     101015
       4        2005    4     101423
       5        2004    6     61152
       5        2005    3     61932
       6        2004    4     38456
       6        2005    3     39039

As you can see both year and mun variables are somewhat different and have only one common entry. So what I'm trying to achieve is to merge freq1 and freq2 columns with respect to ID's. However the trick is that DF1 should take priority (left merge?) in such way that year and mun variables are the ones chosen from DF1. Desired output:

      ID    year    freq1   mun    freq2
       1    2005    2     61137    5
       1    2006    1     61383    3
       2    2005    3     14520    4
       2    2006    2     14604    4
       4    2005    3     101423   6
       4    2006    1     102257   4
       6    2005    0     39039    4
       6    2006    1     39346    3

As well as other way around for DF2 taking priority in such way that:

      ID        year    freq2   mun   freq1
       1        2004    5     60857   2
       1        2005    3     61137   1
       2        2004    4     14278   3
       2        2005    4     14520   2
       3        2004    2     22563   0
       3        2005    0     22635   0
       4        2004    6     101015  3
       4        2005    4     101423  1
       5        2004    6     61152   0
       5        2005    3     61932   0
       6        2004    4     38456   0
       6        2005    3     39039   1

I've tried deleting year and mun columns and merge freq1 and freq2 according to common ID's however it only provides me with multiple duplicate entries. Any suggestions?

Matthew Lundberg · Accepted Answer

It appears that you are trying to match pairs of IDs in the data frames, in the order presented.

Matching on the ID column alone will cause a cross-product to be formed, giving four rows for ID == 1, which is what I assume you mean by "multiple duplicate entries."

To merge the pairs of ID values, you need to disambiguate the individual values, so the merge merges the first ID value in df1 with the first ID value in df2, and similarly for the second ID values.

This disambiguation can be done by adding another column, which adds a counter for the number of ID values seen. seq_along counts, and ave applies to the "levels" of ID:

df1$ID2 <- ave(df1$ID, df1$ID, FUN=seq_along)
df2$ID2 <- ave(df2$ID, df2$ID, FUN=seq_along)

Here's the new df1. df2 is similarly modified.

> df1
  ID year freq1    mun ID2
1  1 2005     2  61137   1
2  1 2006     1  61383   2
3  2 2005     3  14520   1
4  2 2006     2  14604   2
5  4 2005     3 101423   1
6  4 2006     1 102257   2
7  6 2005     0  39039   1
8  6 2006     1  39346   2

These are now appropriate for passing to merge to get the two sides that you want. Removing the unused column from each side prevents the merge from taking data that you don't want:

> merge(df1, df2[-c(2,4)], by=c('ID', 'ID2'), all.x=T)[-2]
  ID year freq1    mun freq2
1  1 2005     2  61137     5
2  1 2006     1  61383     3
3  2 2005     3  14520     4
4  2 2006     2  14604     4
5  4 2005     3 101423     6
6  4 2006     1 102257     4
7  6 2005     0  39039     4
8  6 2006     1  39346     3
> merge(df1[-c(2,4)], df2, by=c('ID', 'ID2'), all.y=T)[-2]
   ID freq1 year freq2    mun
1   1     2 2004     5  60857
2   1     1 2005     3  61137
3   2     3 2004     4  14278
4   2     2 2005     4  14520
5   3    NA 2004     2  22563
6   3    NA 2005     0  22635
7   4     3 2004     6 101015
8   4     1 2005     4 101423
9   5    NA 2004     6  61152
10  5    NA 2005     3  61932
11  6     0 2004     4  38456
12  6     1 2005     3  39039

Note that NA values are used where there is no match. You can replace these with 0 values if that is really appropriate.

The [-2] at the end removes the added column ID2.

This is a fairly unusual way to merge. It depends on the order of the data in addition the values, so it does seem to be fragile. But I do think that I've captured what you want to accomplish.

Merging on pairs of values in one column of each data frame

Answers (2)

Related Questions