Reputation: 44807
I have billions of unsigned 64-bit numbers (x) for which I need to find x % m
. m
is the same for all, but is not a compile-time constant. m
is greater than one.
Both x
and m
are equally likely to hold any 64-bit value.
(If m
were a compile-time constant, I could just trust the compiler to do its best.)
As m
is the same for all of these operations, and I have storage available, is there an optimisation which will let me calculate all the results for x % m
faster than if m
were different for each x
?
(I'm inspired by the division-by-a-constant optimisation which transforms division into multiplication, shifts and addition. Illustrative code in C, Rust or pseudo-code would be great.)
I have looked at the document, but their optimisation of calculating modulus without calculating the quotient only applies to divisors of 2^n +/- 1
.
Upvotes: 6
Views: 294
Reputation: 44807
Below is an approach that I will test, as the comments on the question have made it plain that the distribution of x
and m
are important.
What I hadn't realised is that it is 50% likely that m > x
.
Untested pseudo-code:
fn mod(x: u64, m: u64) -> u64 {
if m > x {
x
} else if 2*m > x {
x - m
} else {
x % m
}
}
This may well be what the compiler produces. I had not tested.
Update: I've tested this optimisation on my i5 2500K, processing 1 GiB of data.
The result is a stable 67% speed increase:
opt: 713.113926ms
plain: 1.195687298s
The benchmarking code is:
use std::time::Instant;
use rand::Rng;
const ARR_LEN: usize = 128 * 1024;
const ROUNDS: usize = 1024;
fn plain_mod_test(x: &[u64; ARR_LEN], m: u64, result: &mut [u64; ARR_LEN]) {
for i in 0..ARR_LEN {
result[i] = x[i] % m;
}
}
fn opt_mod_test(x: &[u64; ARR_LEN], m: u64, result: &mut [u64; ARR_LEN]) {
for i in 0..ARR_LEN {
result[i] = if m > x[i] {
x[i]
} else if m > x[i] / 2 {
x[i] - m
} else {
x[i] % m
}
}
}
fn main() {
// 1 MiB of pseudo-random values x
let mut rng = rand::thread_rng();
let mut x = [0u64; ARR_LEN];
for i in 0..ARR_LEN {
x[i] = rng.gen();
}
// 1 KiB of pseudo-random modulii m
let mut m = [0u64; ROUNDS];
for r in 0..ROUNDS {
m[r] = rng.gen(); // there's only a 1 in 2^64 chance that 0 will be generated
}
// 1 GiB of output each, use Vec to avoid stack overflow
let mut plain_results = vec![[0u64; ARR_LEN]; ROUNDS];
let mut opt_results = vec![[0u64; ARR_LEN]; ROUNDS];
// These loops modulus 1GB of data each.
let start_opt = Instant::now();
for r in 0..ROUNDS {
opt_mod_test(&x, m[r], &mut opt_results[r]);
}
let stop_opt = Instant::now();
println!("opt: {:?}", stop_opt - start_opt);
let start_plain = Instant::now();
for r in 0..ROUNDS {
plain_mod_test(&x, m[r], &mut plain_results[r]);
}
let stop_plain = Instant::now();
println!("plain: {:?}", stop_plain - start_plain);
// Stop the results from being optimised away, by using them.
let mut plain_sum = 0;
let mut opt_sum = 0;
for r in 0..ROUNDS {
for i in 0..ARR_LEN {
plain_sum += plain_results[r][i];
opt_sum += opt_results[r][i];
}
}
println!("opt_sum: {:?}", opt_sum);
println!("plain_sum: {:?}", plain_sum);
}
Upvotes: 1