Reputation: 726
I have two data frames of different length, and I want to add a new column to the first data frame with corresponding values of the second data frame.
The corresponding value is defined by the following condition if (DF1[i,1] == DF2[,1] & DF1[i,2] == DF2[i,2]) == TRUE
, then the value of this row should be taken from DF2 and written to DF1$newColumn[i]
.
The following data frames are used to illustrate the question:
DF1<-data.frame(X = rep(c("A","B","C"),each=3),
Y = rep(c("a","b","c"),each=3))
DF2<-data.frame(X = c("A","B","C"),
Y = c("a","b","c"),
Z = c(1:3))
I tried to use if()
statements as in the text above but the condition returns a vector of TRUE/FALSE and that doesn't seem to work.
The code that works that I use now is
for (i in 1 : length(DF1[,1])) {
DF1$Z[i] <- subset(DF2,DF2$X == DF1$X[i] & DF2$Y == DF1$Y[i])$Z
}
However it is incredibly slow (user system elapsed 115.498 12.341 127.799 for my full dataframe) and there must be a more efficient way to code this. Also, I have read repeatedly that vectorizing is more efficient then loops but I don't know how to do that.
I do need to work with conditional statements though so something like
DF1$Zz<-rep(DF2$Z,each=3)
wouldn't work for my real dataset.
Upvotes: 2
Views: 2065
Reputation: 891
DF1$Z <- sapply(1:nrow(DF1), function(i) DF2$Z[DF2$X==DF1$X[i] & DF2$Y==DF1$Y[i]])
seems to be taking roughly a quarter of the time of your for loop.
I created DF1 with 300 each reps, and my function took ~2secs to run; your loop with subset took ~8secs to run, and repackaging your loop into an sapply it took ~5secs to run.
Upvotes: 2