Lam Phan
Lam Phan

Reputation: 3811

Why Ruby `num >> 1` is slower than `num/2`

i surprised that Ruby num >> 1 is slower than num/2 since i always thought that bit operators should faster than normal operators.

# ruby 3.0.3

require "benchmark/ips"

Benchmark.ips do |bm|
    num = 1000000

    bm.report("num/2") do
        num/2
    end

    bm.report("num >> 1") do
        num >> 1
    end

    bm.compare!
end
# num/2:        16045584.8 i/s
# num >> 1:     14591335.3 i/s - 1.10x  slower


Benchmark.ips do |bm|
    num = 1000000

    bm.report("num * 2") do
        num * 2
    end

    bm.report("num << 1") do
        num << 1
    end

    bm.compare!
end
# num * 2:      18252815.2 i/s
# num << 1:     14289700.6 i/s - 1.28x  slower

Upvotes: 3

Views: 346

Answers (1)

Todd A. Jacobs
Todd A. Jacobs

Reputation: 84423

NB: Updated per Comments

Based on comments, I had failed to include the actual iterations in the testing. That has been corrected, and the code and results updated accordingly.

Mainline Ruby: 100 Million Iterations ≅ 3µs per Iteration

I didn't test with benchmark/ips as that is not a core gem or part of the Ruby standard library. However, based on some refactored tests using the standard Benchmark#bmbm method, it seems that some of your results may be skewed by the overhead of String assignment in some versions of Ruby, as I noticed slightly more variance across multiple runs when not using frozen strings. I also got slightly more consistent results without YJIT enabled in C Ruby, as that seems to add some variable amount of overhead even when its optimizations improve an application's overall speed. However, I didn't notice a material difference on Ruby 3.3.0 with or without frozen strings, and regardless of whether YJIT was enabled or not.

Consider the following, which uses frozen strings on Ruby 3.3.0:

# frozen_string_literal: true

require "benchmark"

NUM  = 1_000_000
ITER = 100_000_000

Benchmark.bmbm do |x|
    x.report("NUM / 2") do ITER.times { NUM / 2 } end
    x.report("NUM >> 1") do ITER.times { NUM >> 1 } end

    x.report("NUM * 2") do ITER.times { NUM * 2 } end
    x.report("NUM << 1") do ITER.times { NUM << 1 } end
end

Even at 100 million iterations, on Ruby 3.3.0 with both YJIT and frozen strings enabled, I consistently get the following plus or minus a few milliseconds when iterating 100 million times using ruby --yjit shift_ops.rb:

Rehearsal --------------------------------------------
NUM / 2    2.172535   0.000201   2.172736 (  2.176693)
NUM >> 1   2.257154   0.000226   2.257380 (  2.261687)
NUM * 2    2.078395   0.000204   2.078599 (  2.081848)
NUM << 1   2.101787   0.000122   2.101909 (  2.104296)
----------------------------------- total: 8.610624sec

               user     system      total        real
NUM / 2    2.153104   0.000117   2.153221 (  2.156942)
NUM >> 1   2.296988   0.000197   2.297185 (  2.301160)
NUM * 2    2.114520   0.000157   2.114677 (  2.118208)
NUM << 1   2.121999   0.000158   2.122157 (  2.125518)

100 million iterations consistently takes less than 3µs of wall clock time per iteration. That seems fast enough to me for any practical purpose. Worrying about iterations per second when total time for each method only varies by about 1.44 femtoseconds per iteration between the highest and lowest values seems unreasonable for any but the most rigorous use cases. For more typical use cases, that minor delta is likely immaterial. However, you can do better with a different engine!

If you really need faster performance out of the reference Ruby implementation, consider using threads, ractors, or a multi-core approach that optimizes for concurrency. The parallel gem can help with that on mainline Ruby. Other engines may provide additional alternatives.

Note: Optimizing IRB Sessions for Benchmarking

Note that if you're running inside IRB instead of calling a file, you should start IRB as:

RUBYOPT="--enable=frozen_string_literal" irb --nocolorize

to ensure that frozen strings are enabled by default and to avoid ANSI colors from impacting performance.

TruffleRuby Performs Better on This Benchmark

It's worth noting that the code above ran faster on C Ruby than on JRuby or TruffleRuby when not looping. This is to be expected, since those engines optimize for different use cases and have higher overhead, but it is nevertheless worth noting that the reference implementation is currently the fastest when you don't need to iterate extensively!

If you truly need to run millions of operations, TruffleRuby in native mode is blazingly fast. The same 100 million iterations on TruffleRuby 24.0.0, run as ruby shift_ops.rb without any other code changes or engine-specific optimizations, yields:

Rehearsal --------------------------------------------
NUM / 2    0.344032   0.009242   0.353274 (  0.265151)
NUM >> 1   0.303473   0.002182   0.305655 (  0.259223)
NUM * 2    0.372648   0.001641   0.374289 (  0.337829)
NUM << 1   0.355533   0.001281   0.356814 (  0.326600)
----------------------------------- total: 1.390032sec

               user     system      total        real
NUM / 2    0.353571   0.000581   0.354152 (  0.245380)
NUM >> 1   0.337815   0.000339   0.338154 (  0.253574)
NUM * 2    0.336717   0.000570   0.337287 (  0.246457)
NUM << 1   0.351651   0.000632   0.352283 (  0.246673)

That cuts the time for 100 million operations down to around 2.54µs per iteration, which is a 900% improvement in total time over a very large set of operations. On smaller sets, the improvement is significantly smaller, but someone will doubtless ding my math if I try to express the difference in the fractions of a millisecond involved.

In other words, you can probably gain the performance improvements you want simply by using a different Ruby engine to perform the calculations. This is definitely worth considering.

Possible Optimization of the * and / Methods

I also found it notable that when testing with TruffleRuby across multiple runs or when increasing the number of iterations, the shift operators often beat the multiplication or division operators during rehearsal, but almost never after warmup. While the total time for both engines dropped during the post-rehearsal benchmarking run, over very large iterations the >> method frequently took noticably longer than the / method, while << showed a much smaller difference compared to *. The same was true for C Ruby, where most runs of any type showed a similar pattern, but the deltas were always wider.

My educated guess, although you'd have to look at the underlying VM or C code to be sure, is that * and / are more frequently used methods in Ruby code. Unlike languages that specifically target systems programming, I would suspect that the multiplication and division methods have received more time, attention, and VM optimization than the binary shift methods. In particular, >> may simply be a more expensive call for Ruby for whatever reason.

That said, if raw performance is your goal then you should use the most performant methods available based on benchmarking your real code. Regardless of the methods you decide to use, these results seem more than fast enough for most practical purposes, but does highlight that the issue is likely in the underlying VM or that engine's implementation of binary shifts rather than in the Ruby language itself.

If You Still Think It's a Bug...

If you think it's a bug as opposed to a language or implementation design decision then you can always report it on the Ruby issue tracker. However, from a purely pragmatic point of view, all options are likely fast enough for most practical use cases. In addition, you always have the option of writing your own native C extensions or calling out from Ruby to a purpose-built tool for number-crunching applications that really do need to perform at a higher rate of instructions per second.

Upvotes: 1

Related Questions