Arun
Arun

Reputation: 118799

Why is subsetting on a "logical" type slower than subsetting on "numeric" type?

Suppose we've a vector (or a data.frame for that matter) as follows:

set.seed(1)
x <- sample(10, 1e6, TRUE)

And one wants to get all values of x where x > 4, say:

a1 <- x[x > 4] # (or) 
a2 <- x[which(x > 4)]

identical(a1, a2) # TRUE

I think most people would prefer x[x > 4]. But surprisingly (at least to me), subsetting using which is faster!

require(microbenchmark)
microbenchmark(x[x > 4], x[which(x > 4)], times = 100)

Unit: milliseconds
            expr      min       lq   median       uq       max neval
        x[x > 4] 56.59467 57.70877 58.54111 59.94623 104.51472   100
 x[which(x > 4)] 26.62217 27.64490 28.31413 29.97908  99.68973   100

It's about 2.1 times faster on mine.

One possibility for the difference, I thought, could be due to the fact that which doesn't consider NA but > returns them as well. But then logical operation itself should be the reason for this difference, which is not the case (obviously). That is:

microbenchmark(x > 4, which(x > 4), times = 100)

Unit: milliseconds
         expr       min       lq   median       uq      max neval
        x > 4  8.182576 10.06163 12.68847 14.64203 60.83536   100
 which(x > 4) 18.579746 19.94923 21.43004 23.75860 64.20152   100

Using which is about 1.7 times slower just before subsetting. But which seems to catch up drastically on/during subsetting.

It seems not possible to use my usual weapon of choice debugonce (thanks to @GavinSimpson) as which calls .Internal(which(x)) whereas == calls .Primitive("==").

My question therefore is why is [ on numeric type resulting from which faster than logical vector resulting from >? Any ideas?

Upvotes: 14

Views: 1142

Answers (3)

asb
asb

Reputation: 4432

I think I should move out of the comments and add an answer. This is my hunch building up on what the others have answered and discussed. (I'm sure the real answer exists in the C source for subset_dflt.)

Once I have a vector x and a logical vector x > 0, I can subset x on x > 0 in two ways. I can use which or I can use the vector x > 0 directly as the indexing. However, we must note that the two are not identical since x[x > 0] will preserve NAs while x[which(x > 0)] will not.

However, in either method, I will need to examine each element of the vector x > 0. In an explicit which call I will have to examine only the boolean state of the element while in a direct sub-setting operation I will have to examine both missing-ness and the boolean state of each element.

@flodel brings an interesting observation. Since [, is.na, which, and | are all primitives or internal routines, let's assume no extraordinary overhead and do this experiment:

microbenchmark(which(x > 0), x[which(x > 0)], x > 0 | is.na(x), x[x > 0],
               unit="us", times=1000)

Unit: microseconds
             expr      min       lq   median       uq      max neval
     which(x > 0) 1219.274 1238.693 1261.439 1900.871 23085.57  1000
  x[which(x > 0)] 1554.857 1592.543 1974.370 2339.238 23816.99  1000
 x > 0 | is.na(x) 3439.191 3459.296 3770.260 4194.474 25234.70  1000
         x[x > 0] 3838.455 3876.816 4267.261 4621.544 25734.53  1000

Considering median values, we can see that, assuming x > 0 | is.na(x) is a crude model of what I am saying happens in logical sub-setting, then the actual time taken in 'subset' is ~ 500 us. And the time taken in 'subset' with which is ~ 700 us. Both the numbers are comparable and indicate that it is not the 'subset'ing itself which is costly in one method or another. In stead, it is what is being done to compute the subset wanted that is cheaper in the which method.

Upvotes: 4

kohske
kohske

Reputation: 66842

This seems to be because subsetting by logical vector is slower than subsetting by numerical index.

> ii <- x > 4
> ij <- which(x > 4)
> 
> head(ii)
[1] FALSE FALSE  TRUE  TRUE FALSE  TRUE
> head(ij)
[1] 3 4 6 7 8 9
>
> microbenchmark(x[ii], x[ij], times = 100)
Unit: milliseconds
  expr       min       lq    median        uq      max neval
 x[ii] 25.574977 26.15414 28.299858 31.080903 82.04686   100
 x[ij]  3.037134  3.31821  3.670096  7.516761 12.39738   100

Updated:

Probably one reason is that, smaller length of the index numeric can reduce the (internal) loop for subsetting and results in the slower evaluation. You can find ik < ij < il

But there would be another difference, because there is a huge difference between ii and il.

> ii <- x > 4
> 
> ij <- which(x > 4)
> ik <- which(x > 9)
> il <- which(x > -1)
> 
> microbenchmark(x[ii], x[ij], x[ik], x[il], times = 100)
Unit: microseconds
  expr       min         lq    median        uq       max neval
 x[ii] 25645.621 25986.2720 28466.412 30693.158 79582.484   100
 x[ij]  3111.974  3281.8280  3477.627  6142.216 55076.121   100
 x[ik]   585.723   628.2125   650.184   682.888  7551.084   100
 x[il]  5266.032  5773.9015  9073.614 10583.312 15113.791   100

Upvotes: 3

Hong Ooi
Hong Ooi

Reputation: 57686

Here's my take on it. Subsetting on a numeric allows pulling out exactly those elements that are required. Subsetting on a logical requires examining each element of the index vector to see if it's TRUE, and then building an internal list of the required elements of the target vector. There are two steps involved, so will take longer.

The difference is biggest is the number of elements extracted is small relative to the size of the original vector. For example:

> z <- rnorm(1e8)
> system.time(z[which(z < -5)])
   user  system elapsed 
   0.58    0.03    0.60 
> system.time(z[z < -5])
   user  system elapsed 
   2.56    0.14    2.70 
> system.time(z[which(z < 5)])
   user  system elapsed 
   1.39    0.30    1.68 
> system.time(z[z < 5])
   user  system elapsed 
   2.82    0.44    3.26 

Here, if you're pulling out only a small proportion of elements (there were 23 elements of z < -5 in my test), using which takes a very small proportion compared to logical indexing. However, if you're extracting a large proportion of elements, the times are closer.

Upvotes: 3

Related Questions