Reputation: 6784
I am using %in% for subsetting and I came across a strange result.
> my.data[my.data$V3 %in% seq(200,210,.01),]
V1 V2 V3 V4 V5 V6 V7
56 470 48.7 209.73 yes 26.3 54 470
That was correct. But when I widen the range... row 56 just disappears
> my.data[my.data$V3 %in% seq(150,210,.01),]
V1 V2 V3 V4 V5 V6 V7
51 458 48.7 156.19 yes 28.2 58 458
67 511 30.5 150.54 yes 26.1 86 511
73 535 40.6 178.76 yes 29.5 73 535
Can you tell me what's wrong? Is there a better way to subset the dataframe?
Here is its structure
> str(my.data)
'data.frame': 91 obs. of 7 variables:
$ V1: Factor w/ 91 levels "100","10004",..: 1 2 3 4 5 6 7 8 9 10 ...
$ V2: num 44.6 22.3 30.4 38.6 15.2 18.3 16.3 12.2 36.7 12.2 ...
$ V3: num 110.83 25.03 17.17 57.23 2.18 ...
$ V4: Factor w/ 2 levels "no","yes": 1 2 2 2 1 1 1 1 1 1 ...
$ V5: num 22.3 30.5 24.4 25.5 4.1 28.4 7.9 5.1 24 12.2 ...
$ V6: int 50 137 80 66 27 155 48 42 65 100 ...
$ V7: chr "" "10004" "10005" "10012" ...
Upvotes: 3
Views: 588
Reputation: 174778
Ooops. You are trying to do exact matching on a computer that can't represent all numbers exactly.
> any(209.73 == seq(200,210,.01))
[1] TRUE
> any(209.73 == seq(150,210,.01))
[1] FALSE
> any(209.73 == zapsmall(seq(150,210,.01)))
[1] TRUE
The reason for the discrepancy is in the second sequence, the value in the sequence is not exactly 209.73
. This is something you have to appreciate when doing computation with computers.
This is covered in many places on the interweb, but in relation to R, see point 7.31 in the R FAQ.
Anyway, that said, you are going about the problem incorrectly. You want to use proper numeric operators:
my.data[my.data$V3 >= 150 & my.data$V3 <= 210, ]
## or
subset(my.data, V3 >= 150 & V3 <= 210)
Upvotes: 8