Loop over multiple columns in a merge in R

Question

I am trying to loop the merging of two dataframes over multiple columns, but I'm having trouble with the code and haven't been able to find any answers on SO. Here are some example data frames:

box <- c(5,7,2)
year <- c(1999,1999,1999)
rep5 <- c(5,5,5)
rep7 <- c(7,7,7)
rep2 <- c(2,2,2)
df1 <- data.frame(box,year,rep5,rep7,rep2)

box1 <- c(5,5,5,5,7,7,7,7,2,2,2,2)
box2 <- c(5,7,2,5,5,7,2,4,5,7,2,9)
year2 <- c(1999,1999,1999,2000,1999,1999,1999,1999,1999,1999,1999,1999)
distance <- c(0,100,200,0,100,0,300,200,200,300,0,300)
df2 <- data.frame(box1,box2,year2,distance)

df1
  box year rep5 rep7 rep2
1   5 1999    5    7    2
2   7 1999    5    7    2
3   2 1999    5    7    2

df2
   box1 box2 year2 distance
1     5    5  1999        0
2     5    7  1999      100
3     5    2  1999      200
4     5    5  2000        0
5     7    5  1999      100
6     7    7  1999        0
7     7    2  1999      300
8     7    4  1999      200
9     2    5  1999      200
10    2    7  1999      300
11    2    2  1999        0
12    2    9  1999      300

What I am trying to do is get the distance information from df2 into df1, with df1 year matched to df2 year, df1 box matched to df2 box1, and df1 rep[i] matched to df2 box2. I can do this for a single df1 rep[i] column as follows:

merge(df1, df2, by.x=c("box", "rep5", "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)

this gives the desired output:

  box rep5 year rep7 rep2 distance
1   2    5 1999    7    2      200
2   5    5 1999    7    2        0
3   7    5 1999    7    2      100

However, in order to save doing this for each rep[i] column individually (I have a lot of these columns in the real data set), I'd like to be able to loop over those columns. Here is the code I have tried to do that:

reps <- c(df1$rep7, df1$rep2)
df3 <- for (i in reps) {merge(df1, df2, by.x=c("box", i, "year"), by.y=c("box1", "box2", "year2"), all.x = TRUE)}
df3

When I run that code, I get the error "Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column." I also tried defining

reps <- c("rep7", "rep2")

When I run the same code using that definition, I get the result that df3 is NULL.

The output that I want (with the distance column renamed for clarity) is:

  box year rep5 rep7 rep2 dist5 dist7 dist2
1   2 1999    5    7    2   200   300     0
2   5 1999    5    7    2     0   100   200
3   7 1999    5    7    2   100     0   300

What am I doing wrong? Any help you can give me would be very much appreciated!

Loop over multiple columns in a merge in R

Answers (1)

Related Questions