Kashif
Kashif

Reputation: 3327

trying to vectorize this operation in R and I don't see why this is wrong

I want to loop through a data frame and create a new column that says 'YES' if the 2nd to 4th elements in the row are 'ANOMALY' and 'NO' otherwise.

for (j in 1:nrow(residual_anomalies)){
  if (all(residual_anomalies[j,2:4]=='ANOMALY')) {residual_anomalies$Prediction_Anomaly[j] <- 'YES'} else
    residual_anomalies$Prediction_Anomaly[j] <- 'NO'
}

So the above is currently what I'm using. It works but it's taking a big computational performance hit so I'm trying to vectorize it. What I had done so far was create a function that returns 'YES' or 'NO' based on if the elements of the row were all 'ANOMALY'.

vote_for_anomaly <- function(x){
  if (all(x)=='ANOMALY') return('YES') else
    return('NO')}

And then I try to use the apply function in R

 aggregates <- apply(residual_anomalies[,2:4],1,vote_for_anomaly)

but then I'm getting the following errors/warnings

Error in if (all(x) == "ANOMALY") return("ANOMALY") else return("NO SIGNAL") : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In all(x) : coercing argument of type 'character' to logical

Can someone tell me why this isn't working and how I should change this?

You can use this data for testing and call it residual_anomalies

1     ANOMALY     ANOMALY     ANOMALY     ANOMALY
2     ANOMALY     NO SIGNAL     ANOMALY     ANOMALY
3     ANOMALY     ANOMALY     ANOMALY     ANOMALY
4     NO SIGNAL     ANOMALY     NO SIGNAL     ANOMALY
5     ANOMALY     ANOMALY     ANOMALY     ANOMALY
6     NO SIGNAL     NO SIGNAL     ANOMALY     ANOMALY

Upvotes: 1

Views: 54

Answers (3)

Gavin Simpson
Gavin Simpson

Reputation: 174778

It might be quicker to do this using indexing, rather than ifelse(). First set up a vector of No of required length:

aggregates <- rep("No", NROW(residual_anomalies))

Then just index this vector where all residual_anomalies[, 2:4] == "ANOMALY"

aggregates[rowSums(residual_anomalies[, 2:4] == "ANOMALY") == 3L] <- "Yes"

This gives:

> aggregates
[1] "Yes" "No"  "Yes" "No"  "Yes" "No"

This part residual_anomalies[, 2:4] == "ANOMALY" creates a logical matrix:

> residual_anomalies[, 2:4] == "ANOMALY"
        V2    V3   V4
[1,]  TRUE  TRUE TRUE
[2,] FALSE  TRUE TRUE
[3,]  TRUE  TRUE TRUE
[4,]  TRUE FALSE TRUE
[5,]  TRUE  TRUE TRUE
[6,] FALSE  TRUE TRUE

When we take the rowsums(), TRUE is converted to 1 and FALSE to 0. Hence only those rows where all elements are TRUE will get selected and assigned "Yes".

Upvotes: 1

cr1msonB1ade
cr1msonB1ade

Reputation: 1716

As @lukeA said you have mixed up your parentheses, but here is a simpler over all solution as well:

aggregates <- ifelse(apply(residual_anomalies, 1, 
     function(x) all(x[2:4] == "ANOMALY")), "YES", "NO")

Upvotes: 0

Nick Kennedy
Nick Kennedy

Reputation: 12640

Per @lukeA, there's a typo in your code. It should be

all(x == "ANOMALY")

but it would be faster to do:

residual_anomalies$Prediction_Anomaly <-
  ifelse(rowSums(residual_anomalies[, 2:4] == "ANOMALY") == 3, "YES", "NO")

rowSums is very fast.

Upvotes: 0

Related Questions