Reputation: 45
I am trying to step through a vector to find the outliers using IQR to calculate a range. When I run this script looking for values to the right of the IQR I get results and when I run to the left I get the error: missing value where TRUE/FALSE needed. How can I scrub out the true and false in my dataset? here is my script:
data = c(100, 120, 121, 123, 125, 124, 123, 123, 123, 124, 125, 167, 180, 123, 156)
Q3 <- quantile(data, 0.75) ##gets the third quantile from the list of vectors
Q1 <- quantile(data, 0.25) ## gets the first quantile from the list of vectors
outliers_left <-(Q1-1.5*IQR(data))
outliers_right <-(Q3+1.5*IQR(data))
IQR <- IQR(data)
paste("the innner quantile range is", IQR)
Q1 # quantil at 0.25
Q3 # quantile at 0.75
# show the range of numbers we have
paste("your range is", outliers_left, "through", outliers_right, "to determine outliers")
# count ho many vectors there are and then we will pass this value into a loop to look for
# anything above and below the Q1-Q3 values
vectorCount <- sum(!is.na(data))
i <- 1
while( i < vectorCount ){
i <- i + 1
x <- data[i]
# if(x < outliers_left) {print(x)} # uncomment this to run and test for the left
if(x > outliers_right) {print(x)}
}
and the error I get is
[1] 167
[1] 180
[1] 156
Error in if (x > outliers_right) { :
missing value where TRUE/FALSE needed
as you can see if you run this script, it is finding my 3 outliers on the right and also throws the error, but when I run this again on the left of my IQR, and I do have an outlier of 100 in the vector, I just get the error without other results being displayed. How can I fix this script? any help greatly appreciated. I've been scouring the web and my books for days on how to fix this.
Upvotes: 1
Views: 9865
Reputation: 69171
As noted in the comments, the error is due to the way you've constructed your while
loop. At the last iteration, i == 16
though there are only 15 elements to process. Changing from i <= vectorCount
to i < vectorCount
fixes the problem:
i <- 1
while( i < vectorCount ){
i <- i + 1
x <- data[i]
# if(x < outliers_left) {print(x)} # uncomment this to run and test for the left
if(x > outliers_right) {print(x)}
}
#-----
[1] 167
[1] 180
[1] 156
However, this is really not how R works and you'll soon be frustrated at how long that code will take to run for any appreciable sized data. R is "vectorized" meaning that you can operate on all 15 elements of data
at once. To print your outliers, I'd do this:
data[data > outliers_right]
#-----
[1] 167 180 156
Or to get all of them at once using the OR operator:
data[data< outliers_left | data > outliers_right]
#-----
[1] 100 167 180 156
For a little context, The above logical comparisons create a boolean value for each element of data
and R only returns those that are TRUE. You can check this for yourself by typing:
data > outliers_right
#----
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
The [
bit is actually an extraction operator, used to retrieve a subset of a data object. See the help page for some good background ?"["
.
Upvotes: 3
Reputation: 115392
The error message arises because you you let i <= vectorCount
so i
can equal vectorCount
, and thus indexing i = i+1
from data will give NA
, and the if
statement will fail.
If you want to find the outliers based on the IQR, you can use findInterval
outliers <- data[findInterval(data, c(Q1,Q3)) != 1]
I would also stop using paste
to create character messages to be printed
, use message
instead.
Upvotes: 1