Jason
Jason

Reputation: 35

Compare two large numeric vectors with multiple conditions without looping through

I have two large vectors of size ~ 100K with integer data in them e.g 0,1,2,3...70. I want to compare these two vectors element by element with multiple conditions and put a value in 3rd vector bases on the condition. If I loop through this using a for loop and multiple if statements, it takes about 5 hours to run on a good power cluster. Is there a way I can speed it up or achieve the results without looping through?

Thanks.

Example:

A <- c(3,0,1,0,6,1,10,5,1,8,1,4) # 12 elements each
B <- c(1,0,5,1,0,2,2,4,0,1,2,10)

Conditions:

if(A[i]==1 && B[i]==1)
{
  C[i] <- "Alpha"
}
if(A[i]>=1 || B[i]>=1)
{
   if(A[i]>1 || B[i]>1)
  {
     C[i] <- "Bravo"
  }
}
if(A[i]==0 || B[i]==0)
{
   if(A[i]>=1 || B[i]>=1)
   {
     C[i] <- "Charlie"
   }
}
if(A[i]==0 && B[i]==0)
{
   C[i] <- "Delta"
}

Upvotes: 0

Views: 522

Answers (2)

Miff
Miff

Reputation: 7941

R is most efficient when you work with whole vectors at once, and let the underlying fortran/C take care of optimisation. So you could try something like:

  C <- rep("Alpha",length(A))
  C[(A>=1 | B>=1) & (A>1 | B>1)] <- "Bravo"
  C[(A==0 | B==0) & (A>=1 | B>=1)] <- "Charlie"
  C[A==0 & B==0] <- "Delta"

note | and & are vectorised versions of || and && that compare elementwise (help is at ?'|')

Upvotes: 2

hrbrmstr
hrbrmstr

Reputation: 78792

I ran your for loop version and the results match the following:

A <- c(3,0,1,0,6,1,10,5,1,8,1,4) # 12 elements each
B <- c(1,0,5,1,0,2,2,4,0,1,2,10)

C <- ifelse((A==1 & B==1), "Alpha", 
            ifelse((A==0 | B==0) & (A>=1 | B>=1), "Charlie",
                   ifelse((A>=1 | B>=1) & (A>1 | B>1), "Bravo",               
                          ifelse(A==0 & B==0, "Delta", NA))))

C

##  [1] "Bravo"   "Delta"   "Bravo"   "Charlie" "Charlie" "Bravo"   "Bravo"   "Bravo"   "Charlie" "Bravo"  
## [11] "Bravo"   "Bravo"

There's definitely a speed improvement, too:

set.seed(1492)

A <- sample(0:10, 100000, replace=TRUE)
B <- sample(0:10, 100000, replace=TRUE)

system.time(C <- ifelse((A==1 & B==1), "Alpha", 
            ifelse((A==0 | B==0) & (A>=1 | B>=1), "Charlie",
                   ifelse((A>=1 | B>=1) & (A>1 | B>1), "Bravo",               
                          ifelse(A==0 & B==0, "Delta", NA)))))

##  user  system elapsed 
## 0.350   0.004   0.354 

The reason for the single & and | operators is straight from the R help:

& and && indicate logical AND and | and || indicate logical OR. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.

Upvotes: 2

Related Questions