Avoid FOR loop in R programming

Question

I have 2 dataframes below,

col1_x <- c(0123,123,234,4567,77789,4578,45588,669887,7887,5547)
col2_x <- c('X1','X8','X2','X55','C12','B11','Z1','SS12','D9','F55')
a    <- c(10,9,8,7,6,5,4,3,2,1)
DF1 <- cbind(col1_x,col2_x,a)
DF1 <- as.data.frame(DF1, stringsAsFactors = F)

col1_y <- c(012,123,56,55,78,5547)
col2_y <- c('X1','X8','S2','ER4','KL1','F55')
b    <- c(111,222,NA,NA,555,666)
DF2 <- cbind(col1_y,col2_y,b)
DF2 <- as.data.frame(DF2, stringsAsFactors = F)

Below are the codes which I written for the execution.

# code1
for (i in 1:nrow(DF2)) { 
  if(is.na(DF2$b[i])) {} else {
    DF1 <-mutate(DF1, 
                 a = ifelse(col1_x == DF2$col1_y[i] & col2_x == DF2$col2_y[i],
                            DF2$b[i],a) )
  }
}

# code2
if(is.na(DF2$b)) {} else {
  DF1$a <- ifelse(DF1$col1_x == DF2$col1_y & DF1$col2_x == DF2$col2_y, DF2$b, DF1$a)
}

I am getting warnings as below when I run code2,

Warning messages:
1: In if (is.na(Y$b)) { :
  the condition has length > 1 and only the first element will be used
2: In X$col1 == Y$col1 :
  longer object length is not a multiple of shorter object length
3: In X$col2 == Y$col2 :
  longer object length is not a multiple of shorter object length

Kindly help me how can I fix this without using FOR loop as it takes a lot of time for iterations.

Note: code1 satisfies my requirement

r2evans · Accepted Answer

This accomplished your code1 without the warnings.

left_join(DF1, DF2, by = c("col1_x" = "col1_y", "col2_x" = "col2_y")) %>%
  mutate(a = coalesce(b, a)) %>%
  select(-b)
#    col1_x col2_x   a
# 1     123     X1  10
# 2     123     X8 222
# 3     234     X2   8
# 4    4567    X55   7
# 5   77789    C12   6
# 6    4578    B11   5
# 7   45588     Z1   4
# 8  669887   SS12   3
# 9    7887     D9   2
# 10   5547    F55 666

If I have interpreted correctly the results that you need, then this is far faster, efficient, and safer than any implementation with for loops and base::ifelse (which can be problematic on its own).

To learn more about merges and joins like this, see How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272. Really, part of data-science-y tasks is knowing how to deal with data consistently, safely, quickly, efficiently, and ... safely. Yes, I said it twice. If there is anything in your code that might, just might, confuse one observation with another, all of your results and inferences are at-best questionable if not completely corrupted. (I'll get off my now.)

As for your warnings:

condition has length > 1 ....

if statements require a length-1 conditional, period. Not length 0, not length 2 or more. Length 1. Since your Y frame (actually DF2 now) has more than 1 row, this is broken.

Think of it this way: if (true) then do task 1 makes sense. if (true, false, false, true true true true, false) do task 1 does not make sense. What should happen?

One of two things are needed here:
- You need if, so you should be looking at one of:
  - any(is.na(Y$b));
  - all(is.na(Y$b)); or
  - a specific one of them, such as is.na(Y$b[17]) (if there were at least 17 of them)
- You need ifelse, which would work on a vector of logicals. (I don't think it's this one.)
longer object length is not a multiple of shorter object length

This seems clear, but you don't understand why it's happening.

Consider these questions:
- c(1,2) == c(1,2) is really asking c(1==1, 2==2), right? Good.
- c(1,2) == 1 is really asking c(1==1, 1==2). Good.
(Neither of those would go in an if statement, btw :-)
- c(1,2) == c(1,2,3,4) is confusingly not an error in R due to argument-recycling. I really think it should be an error, because many of the times it is used/relied-on, it is a mistake, and the results are corrupted/incorrect. However, this is really producing c(1==1, 2==2, 1==3, 2==4). Yup, recycling. And while not a warning/error, this might be useful but is often a silent mistake. This only works though when the length of one vector is a perfect multiple of the length of the other vector.
- c(1,2,9) == c(1,2,3,4,5) will try to recycle as c(1==1, 2==2, 9==3, 1==4, 2==5) (and will give results for that), but ... doesn't that seem just a bit odd to you? Well, it might be okay to you, and while there might be legitimate uses of this type of recycling, it more than often (in my experience) is a mistake in code. If you really mean this and you really know that this type of arbitrary comparisons is what you really want, then wrap it in suppressWarnings and don't come to me when your data results are seemingly inconsistent with the inputs.
More than often when questions pop up with this, instead of ==, people should be thinking "set operations", where they need %in%. Now, think of these:
- c(1,2,9) %in% c(1,2,3,4,5) yields c(TRUE, TRUE, FALSE). (Length 3, not length 5.) You're asking c("is 1 in 1:5?", "is 2 in 1:5?", "is 9 in 1:5?").

Avoid FOR loop in R programming

Answers (1)

Related Questions