Reputation: 118799
Suppose we've a vector
(or a data.frame
for that matter) as follows:
set.seed(1)
x <- sample(10, 1e6, TRUE)
And one wants to get all values of x
where x > 4
, say:
a1 <- x[x > 4] # (or)
a2 <- x[which(x > 4)]
identical(a1, a2) # TRUE
I think most people would prefer x[x > 4]
. But surprisingly (at least to me), subsetting using which
is faster!
require(microbenchmark)
microbenchmark(x[x > 4], x[which(x > 4)], times = 100)
Unit: milliseconds
expr min lq median uq max neval
x[x > 4] 56.59467 57.70877 58.54111 59.94623 104.51472 100
x[which(x > 4)] 26.62217 27.64490 28.31413 29.97908 99.68973 100
It's about 2.1 times faster on mine.
One possibility for the difference, I thought, could be due to the fact that which
doesn't consider NA
but >
returns them as well. But then logical operation itself should be the reason for this difference, which is not the case (obviously). That is:
microbenchmark(x > 4, which(x > 4), times = 100)
Unit: milliseconds
expr min lq median uq max neval
x > 4 8.182576 10.06163 12.68847 14.64203 60.83536 100
which(x > 4) 18.579746 19.94923 21.43004 23.75860 64.20152 100
Using which
is about 1.7 times slower just before subsetting. But which
seems to catch up drastically on/during subsetting.
It seems not possible to use my usual weapon of choice debugonce
(thanks to @GavinSimpson) as which
calls .Internal(which(x))
whereas ==
calls .Primitive("==")
.
My question therefore is why is [
on numeric
type resulting from which
faster than logical vector resulting from >
? Any ideas?
Upvotes: 14
Views: 1142
Reputation: 4432
I think I should move out of the comments and add an answer. This is my hunch building up on what the others have answered and discussed. (I'm sure the real answer exists in the C source for subset_dflt.)
Once I have a vector x
and a logical vector x > 0
, I can subset x
on x > 0
in two ways. I can use which
or I can use the vector x > 0
directly as the indexing. However, we must note that the two are not identical since x[x > 0]
will preserve NA
s while x[which(x > 0)]
will not.
However, in either method, I will need to examine each element of the vector x > 0
. In an explicit which
call I will have to examine only the boolean state of the element while in a direct sub-setting operation I will have to examine both missing-ness and the boolean state of each element.
@flodel brings an interesting observation. Since [
, is.na
, which
, and |
are all primitives or internal routines, let's assume no extraordinary overhead and do this experiment:
microbenchmark(which(x > 0), x[which(x > 0)], x > 0 | is.na(x), x[x > 0],
unit="us", times=1000)
Unit: microseconds
expr min lq median uq max neval
which(x > 0) 1219.274 1238.693 1261.439 1900.871 23085.57 1000
x[which(x > 0)] 1554.857 1592.543 1974.370 2339.238 23816.99 1000
x > 0 | is.na(x) 3439.191 3459.296 3770.260 4194.474 25234.70 1000
x[x > 0] 3838.455 3876.816 4267.261 4621.544 25734.53 1000
Considering median values, we can see that, assuming x > 0 | is.na(x)
is a crude model of what I am saying happens in logical sub-setting, then the actual time taken in 'subset' is ~ 500 us. And the time taken in 'subset' with which is ~ 700 us. Both the numbers are comparable and indicate that it is not the 'subset'ing itself which is costly in one method or another. In stead, it is what is being done to compute the subset wanted that is cheaper in the which
method.
Upvotes: 4
Reputation: 66842
This seems to be because subsetting by logical vector is slower than subsetting by numerical index.
> ii <- x > 4
> ij <- which(x > 4)
>
> head(ii)
[1] FALSE FALSE TRUE TRUE FALSE TRUE
> head(ij)
[1] 3 4 6 7 8 9
>
> microbenchmark(x[ii], x[ij], times = 100)
Unit: milliseconds
expr min lq median uq max neval
x[ii] 25.574977 26.15414 28.299858 31.080903 82.04686 100
x[ij] 3.037134 3.31821 3.670096 7.516761 12.39738 100
Updated:
Probably one reason is that, smaller length of the index numeric can reduce the (internal) loop for subsetting and results in the slower evaluation. You can find ik
< ij
< il
But there would be another difference, because there is a huge difference between ii
and il
.
> ii <- x > 4
>
> ij <- which(x > 4)
> ik <- which(x > 9)
> il <- which(x > -1)
>
> microbenchmark(x[ii], x[ij], x[ik], x[il], times = 100)
Unit: microseconds
expr min lq median uq max neval
x[ii] 25645.621 25986.2720 28466.412 30693.158 79582.484 100
x[ij] 3111.974 3281.8280 3477.627 6142.216 55076.121 100
x[ik] 585.723 628.2125 650.184 682.888 7551.084 100
x[il] 5266.032 5773.9015 9073.614 10583.312 15113.791 100
Upvotes: 3
Reputation: 57686
Here's my take on it. Subsetting on a numeric allows pulling out exactly those elements that are required. Subsetting on a logical requires examining each element of the index vector to see if it's TRUE
, and then building an internal list of the required elements of the target vector. There are two steps involved, so will take longer.
The difference is biggest is the number of elements extracted is small relative to the size of the original vector. For example:
> z <- rnorm(1e8)
> system.time(z[which(z < -5)])
user system elapsed
0.58 0.03 0.60
> system.time(z[z < -5])
user system elapsed
2.56 0.14 2.70
> system.time(z[which(z < 5)])
user system elapsed
1.39 0.30 1.68
> system.time(z[z < 5])
user system elapsed
2.82 0.44 3.26
Here, if you're pulling out only a small proportion of elements (there were 23 elements of z < -5 in my test), using which
takes a very small proportion compared to logical indexing. However, if you're extracting a large proportion of elements, the times are closer.
Upvotes: 3