Latrunculia
Latrunculia

Reputation: 726

How to add a column to a dataframe with values of another based on multiple conditions

I have two data frames of different length, and I want to add a new column to the first data frame with corresponding values of the second data frame.

The corresponding value is defined by the following condition if (DF1[i,1] == DF2[,1] & DF1[i,2] == DF2[i,2]) == TRUE, then the value of this row should be taken from DF2 and written to DF1$newColumn[i].

The following data frames are used to illustrate the question:

DF1<-data.frame(X = rep(c("A","B","C"),each=3),
                Y = rep(c("a","b","c"),each=3))

DF2<-data.frame(X = c("A","B","C"),
                Y = c("a","b","c"),
                Z = c(1:3))

I tried to use if() statements as in the text above but the condition returns a vector of TRUE/FALSE and that doesn't seem to work.

The code that works that I use now is

  for (i in 1 : length(DF1[,1])) {
  DF1$Z[i] <- subset(DF2,DF2$X == DF1$X[i] & DF2$Y == DF1$Y[i])$Z               
}

However it is incredibly slow (user system elapsed 115.498 12.341 127.799 for my full dataframe) and there must be a more efficient way to code this. Also, I have read repeatedly that vectorizing is more efficient then loops but I don't know how to do that.

I do need to work with conditional statements though so something like

DF1$Zz<-rep(DF2$Z,each=3)

wouldn't work for my real dataset.

Upvotes: 2

Views: 2065

Answers (1)

ds440
ds440

Reputation: 891

DF1$Z <- sapply(1:nrow(DF1), function(i) DF2$Z[DF2$X==DF1$X[i] & DF2$Y==DF1$Y[i]]) seems to be taking roughly a quarter of the time of your for loop.

I created DF1 with 300 each reps, and my function took ~2secs to run; your loop with subset took ~8secs to run, and repackaging your loop into an sapply it took ~5secs to run.

Upvotes: 2

Related Questions