mats
mats

Reputation: 133

How to assign NA's using IF statement?

I want to categorize a vector of values between 0 and 1. Values below .001, and values higher than .10 or of no interest. Therefore I want values in these ranges to be NA.

When I run the code below I get a warning:

Error in if (x[i] > 0.001 & x[i] <= 0.01) x[i] = 0.01 :  missing value where TRUE/FALSE needed

How do I fix my code?

for (i in 1:length(x))
  {
    if (x[i] <= .001)
      x[i] = NA
    if (x[i] > .001 & x[i] <= .01)
      x[i] = .01
    if (x[i] > .01 & x[i] <= .02)
      x[i] = .02
    if (x[i] > .02 & x[i] <= .03)
      x[i] = .03
    if (x[i] > .03 & x[i] <= .04)
      x[i] = .04
    if (x[i] > .04 & x[i] <= .05)
      x[i] = .05
    if (x[i] > .05 & x[i] <= .06)
      x[i] = .06
    if (x[i] > .06 & x[i] <= .07)
      x[i] = .07
    if (x[i] > .07 & x[i] <= .08)
      x[i] = .08
    if (x[i] > .08 & x[i] <= .09)
      x[i] = .09
    if (x[i] > .09 & x[i] <= .10)
      x[i] = .10
    if (x[i] > .10 & x[i] <= 1)
      x[i] = NA
  }

Upvotes: 4

Views: 30876

Answers (4)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

First, some test data:

set.seed(1); x = dnorm(rnorm(100))/(sample(1:100, 100, replace=TRUE))

Subsetting can be done in the following way:

x[x < .001] = NA
x[x > .1] = NA

Or, you can combine it in one statement:

x[x < .001 | x > .1] = NA

Update: To answer why your code is not working

You're running into problems if it does find an NA in there, so remove them from your for loop, but index them before you run the loop so you can remove them later.

temp = which(x < .001 | x > .1) # Index the values you want to set as NA

Remove the following conditions from your for loop:

if (x[i] > .10 & x[i] <= 1)
  x[i] = NA
if (x[i] <= .001)
  x[i] = NA

Run your for loop, and then use temp to set the values to NA that should be NA.

x[temp] = NA

Hope this helps!

Update 2: Two lines

x[x < .001 | x > .1] = NA
out <- ceiling(x*100)/100

Pretty much the same as AKE's suggestion using floor.

This should get you the same results as your loop.

Upvotes: 7

IRTFM
IRTFM

Reputation: 263481

The findInterval function can be used productively in this very structured choice problem. It produces an index that can "lookup" or select the desired result for values in particular intervals:

x <- rnorm(1000)
x <- c(NA, seq(0.1, 1, by=0.1), NA)[
            1+ findInterval(x, c(0.001, seq(0.1, 1, by=0.1)) ,rightmost.closed=TRUE) ]
#---------------
table(x)
x
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 
 34  38  48  44  29  30  26  20  17  31 
> table(is.na(x))

FALSE  TRUE 
  317   683

The rightmost.closed argument shift the usual leftmost closure of intervals, although in this example it didn't matter, since none of the random draws were on boundaries. It's generally not a good idea to destroy your input data, though. I hope x was a copy of your original data. The other way of doing this would be to omit the 1+ and instead use intervals in the findInterval second argument like c(-Inf, 0.001, seq(0.1, 1, by=0.1) , Inf)

Upvotes: 0

Jason Morgan
Jason Morgan

Reputation: 2330

Instead of using an explicit for loop, you should try to use a vectorized function, such as the very handy ifelse. Here is how to recode the NAs in your example:

> x <- ifelse(x <= 0.001 | x > 0.1, NA, x)

To recode the other values, you could try some "clever" use of cut:

> x <- (cut(x, breaks=seq(0.01, 0.09, 0.01), labels=FALSE) / 100) + 0.01

though there are likely better (and more transparent) ways. The reason for avoiding explicit for loops in R is that they are very inefficient when compared to vectorized alternatives. The R Inferno provides a good discussion of this and other R tricks and tips.

Upvotes: 1

Assad Ebrahim
Assad Ebrahim

Reputation: 6361

While your solution works conceptually, it is "brute force", which means a lot of typing, won't scale to a slightly different problem, and is also slow to execute.

R allows working with vectors so if your logic works for an arbitrary number between 0 and 1, then it should work with a vector of values between 0 and 1.

Try something like the following:

      y=((floor(100*x))       # all values < 0.01 map to 0
      if y>10 then y=0        # force values > 0.1 to 0
      if y>0, then (y+1)/100  # for non-zero values, map to the upper interval, then return to original scale.

The first line squashes all values less than 0.01 to 0. The second line squashes all values greater than 0.1 to 0. The third line lifts the remaining non zero values to the top value of the range (round up) and returns them to the original scale.

Upvotes: 0

Related Questions