Reputation: 99
I have 2 dataframes below,
col1_x <- c(0123,123,234,4567,77789,4578,45588,669887,7887,5547)
col2_x <- c('X1','X8','X2','X55','C12','B11','Z1','SS12','D9','F55')
a <- c(10,9,8,7,6,5,4,3,2,1)
DF1 <- cbind(col1_x,col2_x,a)
DF1 <- as.data.frame(DF1, stringsAsFactors = F)
col1_y <- c(012,123,56,55,78,5547)
col2_y <- c('X1','X8','S2','ER4','KL1','F55')
b <- c(111,222,NA,NA,555,666)
DF2 <- cbind(col1_y,col2_y,b)
DF2 <- as.data.frame(DF2, stringsAsFactors = F)
Below are the codes which I written for the execution.
# code1
for (i in 1:nrow(DF2)) {
if(is.na(DF2$b[i])) {} else {
DF1 <-mutate(DF1,
a = ifelse(col1_x == DF2$col1_y[i] & col2_x == DF2$col2_y[i],
DF2$b[i],a) )
}
}
# code2
if(is.na(DF2$b)) {} else {
DF1$a <- ifelse(DF1$col1_x == DF2$col1_y & DF1$col2_x == DF2$col2_y, DF2$b, DF1$a)
}
I am getting warnings as below when I run code2,
Warning messages:
1: In if (is.na(Y$b)) { :
the condition has length > 1 and only the first element will be used
2: In X$col1 == Y$col1 :
longer object length is not a multiple of shorter object length
3: In X$col2 == Y$col2 :
longer object length is not a multiple of shorter object length
Kindly help me how can I fix this without using FOR loop as it takes a lot of time for iterations.
Note: code1 satisfies my requirement
Upvotes: 1
Views: 51
Reputation: 160952
This accomplished your code1 without the warnings.
left_join(DF1, DF2, by = c("col1_x" = "col1_y", "col2_x" = "col2_y")) %>%
mutate(a = coalesce(b, a)) %>%
select(-b)
# col1_x col2_x a
# 1 123 X1 10
# 2 123 X8 222
# 3 234 X2 8
# 4 4567 X55 7
# 5 77789 C12 6
# 6 4578 B11 5
# 7 45588 Z1 4
# 8 669887 SS12 3
# 9 7887 D9 2
# 10 5547 F55 666
If I have interpreted correctly the results that you need, then this is far faster, efficient, and safer than any implementation with for
loops and base::ifelse
(which can be problematic on its own).
To learn more about merges and joins like this, see How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272. Really, part of data-science-y tasks is knowing how to deal with data consistently, safely, quickly, efficiently, and ... safely. Yes, I said it twice. If there is anything in your code that might, just might, confuse one observation with another, all of your results and inferences are at-best questionable if not completely corrupted. (I'll get off my </soapbox>
now.)
As for your warnings:
condition has length > 1 ...
.
if
statements require a length-1 conditional, period. Not length 0, not length 2 or more. Length 1. Since your Y
frame (actually DF2
now) has more than 1 row, this is broken.
Think of it this way: if (true) then do task 1
makes sense. if (true, false, false, true true true true, false) do task 1
does not make sense. What should happen?
One of two things are needed here:
You need if
, so you should be looking at one of:
any(is.na(Y$b))
;all(is.na(Y$b))
; oris.na(Y$b[17])
(if there were at least 17 of them)You need ifelse
, which would work on a vector of logicals. (I don't think it's this one.)
longer object length is not a multiple of shorter object length
This seems clear, but you don't understand why it's happening.
Consider these questions:
c(1,2) == c(1,2)
is really asking c(1==1, 2==2)
, right? Good.c(1,2) == 1
is really asking c(1==1, 1==2)
. Good.(Neither of those would go in an if
statement, btw :-)
c(1,2) == c(1,2,3,4)
is confusingly not an error in R due to argument-recycling. I really think it should be an error, because many of the times it is used/relied-on, it is a mistake, and the results are corrupted/incorrect. However, this is really producing c(1==1, 2==2, 1==3, 2==4)
. Yup, recycling. And while not a warning/error, this might be useful but is often a silent mistake. This only works though when the length of one vector is a perfect multiple of the length of the other vector.
c(1,2,9) == c(1,2,3,4,5)
will try to recycle as c(1==1, 2==2, 9==3, 1==4, 2==5)
(and will give results for that), but ... doesn't that seem just a bit odd to you? Well, it might be okay to you, and while there might be legitimate uses of this type of recycling, it more than often (in my experience) is a mistake in code. If you really mean this and you really know that this type of arbitrary comparisons is what you really want, then wrap it in suppressWarnings
and don't come to me when your data results are seemingly inconsistent with the inputs.
More than often when questions pop up with this, instead of ==
, people should be thinking "set operations", where they need %in%
. Now, think of these:
c(1,2,9) %in% c(1,2,3,4,5)
yields c(TRUE, TRUE, FALSE)
. (Length 3, not length 5.) You're asking c("is 1 in 1:5?", "is 2 in 1:5?", "is 9 in 1:5?")
.Upvotes: 3