Reputation: 77
I am trying to merge two data frames with columns of different lengths and rows.To give the exact idea DF1 is:
ID year freq1 mun
1 2005 2 61137
1 2006 1 61383
2 2005 3 14520
2 2006 2 14604
4 2005 3 101423
4 2006 1 102257
6 2005 0 39039
6 2006 1 39346
Whereas DF2 is:
ID year freq2 mun
1 2004 5 60857
1 2005 3 61137
2 2004 4 14278
2 2005 4 14520
3 2004 2 22563
3 2005 0 22635
4 2004 6 101015
4 2005 4 101423
5 2004 6 61152
5 2005 3 61932
6 2004 4 38456
6 2005 3 39039
As you can see both year and mun variables are somewhat different and have only one common entry. So what I'm trying to achieve is to merge freq1 and freq2 columns with respect to ID's. However the trick is that DF1 should take priority (left merge?) in such way that year and mun variables are the ones chosen from DF1. Desired output:
ID year freq1 mun freq2
1 2005 2 61137 5
1 2006 1 61383 3
2 2005 3 14520 4
2 2006 2 14604 4
4 2005 3 101423 6
4 2006 1 102257 4
6 2005 0 39039 4
6 2006 1 39346 3
As well as other way around for DF2 taking priority in such way that:
ID year freq2 mun freq1
1 2004 5 60857 2
1 2005 3 61137 1
2 2004 4 14278 3
2 2005 4 14520 2
3 2004 2 22563 0
3 2005 0 22635 0
4 2004 6 101015 3
4 2005 4 101423 1
5 2004 6 61152 0
5 2005 3 61932 0
6 2004 4 38456 0
6 2005 3 39039 1
I've tried deleting year and mun columns and merge freq1 and freq2 according to common ID's however it only provides me with multiple duplicate entries. Any suggestions?
Upvotes: 1
Views: 1022
Reputation: 42629
It appears that you are trying to match pairs of ID
s in the data frames, in the order presented.
Matching on the ID
column alone will cause a cross-product to be formed, giving four rows for ID == 1
, which is what I assume you mean by "multiple duplicate entries."
To merge the pairs of ID
values, you need to disambiguate the individual values, so the merge
merges the first ID
value in df1
with the first ID
value in df2
, and similarly for the second ID
values.
This disambiguation can be done by adding another column, which adds a counter for the number of ID
values seen. seq_along
counts, and ave
applies to the "levels" of ID
:
df1$ID2 <- ave(df1$ID, df1$ID, FUN=seq_along)
df2$ID2 <- ave(df2$ID, df2$ID, FUN=seq_along)
Here's the new df1
. df2
is similarly modified.
> df1
ID year freq1 mun ID2
1 1 2005 2 61137 1
2 1 2006 1 61383 2
3 2 2005 3 14520 1
4 2 2006 2 14604 2
5 4 2005 3 101423 1
6 4 2006 1 102257 2
7 6 2005 0 39039 1
8 6 2006 1 39346 2
These are now appropriate for passing to merge
to get the two sides that you want. Removing the unused column from each side prevents the merge from taking data that you don't want:
> merge(df1, df2[-c(2,4)], by=c('ID', 'ID2'), all.x=T)[-2]
ID year freq1 mun freq2
1 1 2005 2 61137 5
2 1 2006 1 61383 3
3 2 2005 3 14520 4
4 2 2006 2 14604 4
5 4 2005 3 101423 6
6 4 2006 1 102257 4
7 6 2005 0 39039 4
8 6 2006 1 39346 3
> merge(df1[-c(2,4)], df2, by=c('ID', 'ID2'), all.y=T)[-2]
ID freq1 year freq2 mun
1 1 2 2004 5 60857
2 1 1 2005 3 61137
3 2 3 2004 4 14278
4 2 2 2005 4 14520
5 3 NA 2004 2 22563
6 3 NA 2005 0 22635
7 4 3 2004 6 101015
8 4 1 2005 4 101423
9 5 NA 2004 6 61152
10 5 NA 2005 3 61932
11 6 0 2004 4 38456
12 6 1 2005 3 39039
Note that NA
values are used where there is no match. You can replace these with 0
values if that is really appropriate.
The [-2]
at the end removes the added column ID2
.
This is a fairly unusual way to merge. It depends on the order of the data in addition the values, so it does seem to be fragile. But I do think that I've captured what you want to accomplish.
Upvotes: 2
Reputation: 2970
Use match function to find corresponding rows between DF1 and DF2. See the code below.
# Find rows in DF1 that matches rows in DF2, get "freq2" values from them.
cbind(DF1, DF2[ match( DF1[,"year"], DF2[,"year"] ), "freq2" ])
# Find rows in DF1 that matches rows in DF2, get "freq2" values from them.
cbind(DF2, DF1[ match( DF2[,"year"], DF1[,"year"] ), "freq1" ])
Upvotes: 2