Reputation: 4880
Playing with rust
iterators and lambdas I've implemented N
prime numbers generator (original playground / improved code.).
It performs very poorly (10 times slower) compared to equivalent functional code in other language I'm more proficient with on the same box.
Some numbers:
test calc_primes ... bench: 7,754,795 ns/iter (+/- 1,887,591)
#[macro_use]
extern crate bencher;
use bencher::Bencher;
use nth_prime::*;
fn calc_primes(b: &mut Bencher) {
b.iter(|| primes(10000));
}
benchmark_group!(benches, calc_primes);
benchmark_main!(benches);
I expect rust to outperform it, but i struggle to get rid of flaws in below code. So here's the call for help optimising this particular implementation.
Vec
allocations, some are surely not needed...OR
on fvec bitarray
which is super quick. Logic in below rust express same approach in the fold
call, but this is using Vec<bool>
which must be slow vs bitarray equivalent... Vec<bool>
, wouldn't mind allocating massive byte-array and operate with bit-shifts etc., but can't imagine putting this together...pub fn primes(n: u64) -> Vec<usize> {
if n < 4 {
vec![2]
} else {
base(primes((n as f64).sqrt().ceil() as u64), n)
}
}
fn base(r: Vec<usize>, p: u64) -> Vec<usize> {
let fvec = (0..p).map(|_| false).collect::<Vec<bool>>();
let z = r
.iter()
.map(|&x| (0..x).map(|y| y > 0).collect::<Vec<bool>>())
.map(|x| {
(0..p)
.map(|n| !x[(n % (x.len() as u64)) as usize])
.collect::<Vec<bool>>()
})
.fold(fvec, |acc, x| {
acc.iter().zip(x).map(|(a, b)| *a || b).collect()
});
let yy: Vec<usize> = z
.iter()
.enumerate()
.filter_map(|(i, x)| if !*x { Some(i) } else { None })
.skip(1)
.collect();
r.iter().chain(yy.iter()).map(|x| *x).collect()
}
Regarding prime number generation algorithms, there's outstanding answer in this thread, so it's not a question about optimising algorithm, but this particular implementation without external crates (it's a learning exercise).
EDIT:
After some hints from comments section, got it significantly optimised:
test calc_primes ... bench: 4,776,400 ns/iter (+/- 1,427,534)
test calc_primes_sieve ... bench: 644,220 ns/iter (+/- 1,655,581)
test calc_primes_2 ... bench: 268,598 ns/iter (+/- 46,440)
Vec<bool>
@fold
bitmap with idiomatic iterator way of : step_by
fn base2(head: Vec<u64>, p: u64) -> Vec<u64> {
let mut fvec = (0..p).map(|_| false).collect::<Vec<bool>>();
head.iter()
.map(|&x| {
(0..p)
.step_by(x as usize)
.for_each(|y| fvec[y as usize] = true)
})
.for_each(drop);
let tail: Vec<u64> = fvec.iter().enumerate()
.filter_map(|(i, x)| if !*x { Some(i as u64) } else { None })
.skip(1)
.collect();
head.iter().chain(tail.iter()).cloned().collect()
}
EDIT 2:
For reference, here's the 'other' language implementation:
q)p:{$[x<4; enlist 2; r, 1_ where not any x#'not til each r:p ceiling sqrt x]}
q)\ts do[1000;p 10000]
474 443088
=
474,000 ns/iter
For 64bit q v4.0:
q)\t do[1000;p 10000]
273
=
273,000 ns/iter
Not to bad for interpreted language.
EDIT 3
Added more benchmarks for new implementations from answers.
(as appeared in chronological order)
Was able to shave off tad bit with no skip(1)
and diff fvec
allocation
New playground code.
test ssqrt_p_1__vec_bool_n_mod ... bench: 6,838,150 ns/iter (+/- 627,389)
test sieve_p_2__mut_step_by ... bench: 367,229 ns/iter (+/- 38,279)
test ssqrt_p_3__mut_step_by ... bench: 266,189 ns/iter (+/- 56,215)
test ssqrt_p_4__push_step_by_mod ... bench: 1,255,206 ns/iter (+/- 262,107)
test sieve_p_5__for_loop_mut ... bench: 441,397 ns/iter (+/- 47,077)
test ssqrt_p_6__no_skip ... bench: 176,186 ns/iter (+/- 24,765)
StdDev plays big role now... running with:
taskset -c 0 cargo bench --jobs 1
fn base3(head: Vec<u64>, p: u64) -> Vec<u64> {
let mut fvec = vec![false; p as usize]; fvec[1] = true;
head.iter()
.map(|&x| {
(0..p)
.step_by(x as usize)
.for_each(|y| fvec[y as usize] = true)
})
.for_each(drop);
let tail: Vec<u64> = fvec
.iter()
.enumerate()
.filter_map(|(i, x)| if !*x { Some(i as u64) } else { None })
.collect();
head.iter().chain(tail.iter()).cloned().collect()
}
EDIT 5
Surprisingly below no_chain
is slower than base3:no_skip
version. Guess that's compiler magic (tail call optimisation in recursive functions ???).
fn base4(head: Vec<u64>, p: u64) -> Vec<u64> {
let mut fvec = vec![false; p as usize]; fvec[1] = true;
head.iter()
.map(|&x| {
(0..p)
.step_by(x as usize)
.for_each(|y| fvec[y as usize] = true);
fvec[x as usize] = false;
})
.for_each(drop);
fvec.iter().enumerate()
.filter_map(|(i, x)| if !*x { Some(i as u64) } else { None })
.collect()
}
Upvotes: 3
Views: 966
Reputation: 2026
One thing to note is that the performance of different methods might change with input size. It also might not, the point is that using only one input size when benchmark might yield results which don't generalize to all sizes.
Here's the original code in the question:
pub fn primes(n: u64) -> Vec<usize> {
if n < 4 {
vec![2]
} else {
base(primes((n as f64).sqrt().ceil() as u64), n)
}
}
fn base(r: Vec<usize>, p: u64) -> Vec<usize> {
let fvec = (0..p).map(|_| false).collect::<Vec<bool>>();
let z = r
.iter()
.map(|&x| (0..x).map(|y| y > 0).collect::<Vec<bool>>())
.map(|x| {
(0..p)
.map(|n| !x[(n % (x.len() as u64)) as usize])
.collect::<Vec<bool>>()
})
.fold(fvec, |acc, x| {
acc.iter().zip(x).map(|(a, b)| *a || b).collect()
});
let yy: Vec<usize> = z
.iter()
.enumerate()
.filter_map(|(i, x)| if !*x { Some(i) } else { None })
.skip(1)
.collect();
r.iter().chain(yy.iter()).map(|x| *x).collect()
}
I found the above code rather tricky to read.understand. It seems to have a lot of unnecessary noise, like unneeded collects. Additionally the variable names could be a lot more descriptive.
Below is my naive implementation of a sieve(calc_primes_mine_naive, on the benchmarks below). It has no mutability, though admittedly has an unnecessary collect. It is 2x to 3x faster than the above.
pub fn primes_less_than(n: u64) -> Vec<usize> {
let is_not_prime: Vec<bool> = (0..n).map(|num| {
let sieve_upper_bound = (num as f64).sqrt().ceil() as u64 + 1;
(2..sieve_upper_bound).any(|divisor| {
(num % divisor) == 0
})
}).collect();
is_not_prime.iter().enumerate().filter(|(_, this_one_not_prime)| {
!**this_one_not_prime
}).map(|(definitely_prime, _)| { definitely_prime }).collect()
}
Below is the first attempt at optimizing. Generaly sqrts are slow, so getting rid of repeated sqrts as above should be an absolute win(didn't seem to make much a difference irl).
pub fn primes_less_than_naive(n: u64) -> Vec<usize> {
let mut sqrt = 1;
let is_not_prime: Vec<bool> = (0..n).map(|num| {
if sqrt * sqrt <= num {
sqrt += 1;
}
let sieve_upper_bound = sqrt + 1;
(2..sieve_upper_bound).any(|divisor| {
(num % divisor) == 0
})
}).collect();
is_not_prime.iter().enumerate().filter(|(_, this_one_not_prime)| {
!**this_one_not_prime
}).map(|(definitely_prime, _)| { definitely_prime }).collect()
}
I looked at the assembly and profiled it in vtune. Vtune was complaining about "high MS switch" overhead. I intially assume this was caused by casts to and from floats, but it continued after all float operations where removed. Looking at the assembly seemed to imply that there might still be some overhead due to collect
, so I got rid of the extra collect.
pub fn primes_less_than_naive2(n: u64) -> Vec<usize> {
let mut sqrt = 1;
(0..n).map(|num| {
if sqrt * sqrt <= num {
sqrt += 1;
}
let sieve_upper_bound = sqrt + 1;
(2..sieve_upper_bound).any(|divisor| {
(num % divisor) == 0
})
}).enumerate().filter(|(_, this_one_not_prime)| {
!*this_one_not_prime
}).map(|(definitely_prime, _)| { definitely_prime }).collect()
}
Having no mutability seems to have hurt performance a fair bit.
There may be a way to express this without using a mut
, but probably would be rather unwieldy code.
pub fn primes_less_than_naive_with_mut(n: usize) -> Vec<usize> {
let mut is_not_prime = vec![false; n as usize];
let sieve_upper_bound = (n as f64).sqrt().ceil() as usize + 1;
(1..sieve_upper_bound).for_each(|step_size| {
(0..n).step_by(step_size).for_each(|i| {
is_not_prime[i] = true;
})
});
is_not_prime.iter().enumerate().filter(|(_, this_one_not_prime)| {
!**this_one_not_prime
}).map(|(definitely_prime, _)| { definitely_prime }).collect()
}
At this point vtune complains that some parts of the above are core-bound, and some parts are memory bound. This is annoying, b/c the way to fix the memory bound issue is to use bitvecs, and the way to fix core bound is to not use bitvecs. In attempt to alleviate memory boundness I tried alternating the loop direction, the idea being that this would make better use of cache. When I first wrote this I found some evidence it worked, but since then I haven't.
pub fn primes_less_than_naive_alternating(n: usize) -> Vec<usize> {
let mut is_not_prime = vec![false; n as usize];
let sieve_upper_bound = (n as f64).sqrt().ceil() as usize + 1;
let mut forward = true;
(1..sieve_upper_bound).for_each(|step_size| {
if forward {
(0..n).step_by(step_size).for_each(|i| {
is_not_prime[i] = true;
})
} else {
(0..n).step_by(step_size).rev().for_each(|i| {
is_not_prime[i] = true;
})
};
forward = !forward;
});
is_not_prime.iter().enumerate().filter(|(_, this_one_not_prime)| {
!**this_one_not_prime
}).map(|(definitely_prime, _)| { definitely_prime }).collect()
}
Bitvec version. This is slower(unsurprisingly). It may be possible to make it fast, by having some kind of mask lookup table for small step_size, but I'm not optimistic. Vtune has nothing interesting to say except that there might be lots of data dependencies messing this up.(aka it can't out of order well b/c the next byte depends on the previous byte[b/c behind the scenes the loads and stores are word sized?])
pub unsafe fn primes_less_than_naive_bitvec(n: usize) -> Vec<usize> {
let num_bytes = n / 8 + 1;
let mut is_not_prime = libc::calloc(size_of::<u8>(), num_bytes) as *mut u8;
let sieve_upper_bound = (n as f64).sqrt().ceil() as usize + 1;
(1..sieve_upper_bound).for_each(|step_size| {
(0..n).step_by(step_size).for_each(|i| {
let address = i >> 3;
let bit_within_the_byte = i & 0b111;
let cur = is_not_prime.offset(address as isize);
*cur |= ((0x1 as u8) << bit_within_the_byte);
});
});
(0..n).filter(|i|{
let address = i / 8;
let bit_within_the_byte = i % 8;
let cur = is_not_prime.offset(address as isize).read();
let is_not_prime = (cur | ((0x1 as u8) << bit_within_the_byte)) > 0;
!is_not_prime
}).collect::<Vec<usize>>()
}
At n=10:
test calc_primes ... bench: 752 ns/iter (+/- 61)
test calc_primes_mine ... bench: 314 ns/iter (+/- 20)
test calc_primes_mine_alternating_loop_direction ... bench: 110 ns/iter (+/- 6)
test calc_primes_mine_bitvec ... bench: 165 ns/iter (+/- 888)
test calc_primes_mine_naive ... bench: 304 ns/iter (+/- 29)
test calc_primes_mine_naive2 ... bench: 250 ns/iter (+/- 16)
test calc_primes_mine_naive_with_mut ... bench: 70 ns/iter (+/- 4)
At n=100:
test calc_primes ... bench: 5,924 ns/iter (+/- 2,210)
test calc_primes_mine ... bench: 3,384 ns/iter (+/- 88)
test calc_primes_mine_alternating_loop_direction ... bench: 673 ns/iter (+/- 23)
test calc_primes_mine_bitvec ... bench: 1,194 ns/iter (+/- 36)
test calc_primes_mine_naive ... bench: 3,141 ns/iter (+/- 195)
test calc_primes_mine_naive2 ... bench: 2,919 ns/iter (+/- 107)
test calc_primes_mine_naive_with_mut ... bench: 521 ns/iter (+/- 31)
At n=1000:
test calc_primes ... bench: 120,828 ns/iter (+/- 18,018)
test calc_primes_mine ... bench: 59,269 ns/iter (+/- 6,993)
test calc_primes_mine_alternating_loop_direction ... bench: 7,080 ns/iter (+/- 2,795)
test calc_primes_mine_bitvec ... bench: 16,186 ns/iter (+/- 470)
test calc_primes_mine_naive ... bench: 56,572 ns/iter (+/- 1,849)
test calc_primes_mine_naive2 ... bench: 54,526 ns/iter (+/- 9,773)
test calc_primes_mine_naive_with_mut ... bench: 5,873 ns/iter (+/- 531)
At n=10,000
test calc_primes ... bench: 2,672,089 ns/iter (+/- 117,553)
test calc_primes_mine ... bench: 1,222,262 ns/iter (+/- 24,242)
test calc_primes_mine_alternating_loop_direction ... bench: 77,348 ns/iter (+/- 4,845)
test calc_primes_mine_bitvec ... bench: 201,894 ns/iter (+/- 4,446)
test calc_primes_mine_naive ... bench: 1,213,264 ns/iter (+/- 160,097)
test calc_primes_mine_naive2 ... bench: 1,174,753 ns/iter (+/- 88,116)
test calc_primes_mine_naive_with_mut ... bench: 66,283 ns/iter (+/- 6,358)
At=100,000:
test calc_primes ... bench: 71,953,394 ns/iter (+/- 11,795,923)
test calc_primes_mine ... bench: 27,542,006 ns/iter (+/- 5,158,828)
test calc_primes_mine_alternating_loop_direction ... bench: 1,103,863 ns/iter (+/- 135,770)
test calc_primes_mine_bitvec ... bench: 2,547,044 ns/iter (+/- 180,308)
test calc_primes_mine_naive ... bench: 27,399,779 ns/iter (+/- 3,121,742)
test calc_primes_mine_naive2 ... bench: 26,430,825 ns/iter (+/- 1,818,277)
test calc_primes_mine_naive_with_mut ... bench: 1,045,803 ns/iter (+/- 103,090)
At n=1000000:
test calc_primes ... bench: 1,827,068,877 ns/iter (+/- 131,368,352)
test calc_primes_mine ... bench: 658,940,900 ns/iter (+/- 45,559,627)
test calc_primes_mine_alternating_loop_direction ... bench: 18,112,057 ns/iter (+/- 1,839,472)
test calc_primes_mine_bitvec ... bench: 30,189,983 ns/iter (+/- 2,649,749)
test calc_primes_mine_naive ... bench: 658,018,301 ns/iter (+/- 39,323,294)
test calc_primes_mine_naive2 ... bench: 651,249,390 ns/iter (+/- 22,800,036)
test calc_primes_mine_naive_with_mut ... bench: 18,212,183 ns/iter (+/- 2,936,353)
Upvotes: 0
Reputation: 8095
I found two more promising optimizations:
The classical algorithm:
pub fn primes_erathosthenes(upper_bound: u64) -> Vec<u64> {
let sieve_size = (upper_bound + 1) as u64;
let mut sieve = vec![false; sieve_size as usize];
let mut p = 2;
'next: loop {
if p * p >= sieve_size {
break 'next;
}
for f in p..=(upper_bound / p) {
let index = p * f;
sieve[index as usize] = true;
}
p += 1;
while sieve[p as usize] {
p += 1;
}
}
sieve
.into_iter()
.enumerate()
.skip(2)
.filter_map(|(i, p)| if p { None } else { Some(i as u64) })
.collect()
}
Your final algorithm with improved loop
fn base_improved(head: Vec<u64>, p: u64) -> Vec<u64> {
let mut fvec = vec![false; p as usize];
fvec[0] = true;
fvec[1] = true;
head
.iter()
.flat_map(|&x| (2u64..p).map(move |i| i * x).take_while(|y| *y < p))
.for_each(|y| fvec[y as usize] = true);
fvec
.iter()
.enumerate()
.filter_map(|(i, x)| if !*x { Some(i as u64) } else { None })
.collect()
}
and a faster sieving algorithm (sundarams sieve, i can post the code if it is of interest)
Benchmarks:
test calc_erathosthenes ... bench: 34,192 ns/iter (+/- 2,902)
test calc_exercism ... bench: 79,314 ns/iter (+/- 16,276)
test calc_stackoverflow ... bench: 35,431 ns/iter (+/- 5,674)
test calc_stackoverflow_improved ... bench: 32,508 ns/iter (+/- 6,971)
test calc_sundaram ... bench: 17,930 ns/iter (+/- 3,182)
test calc_veedrac ... bench: 27,110 ns/iter (+/- 3,870)
test calc_veedrac_6 ... bench: 12,178 ns/iter (+/- 4,113)
Upvotes: 2
Reputation: 60227
Your code has some very strange overheads, like
(0..x).map(|y| y > 0).collect::<Vec<bool>>()
which you only use like x[(n % (x.len() as u64)) as usize]
... aka. as a really slow way of writing n % x == 0
.
Here's an almost-direct variant of your code where I simply remove these overheads you've added, and inline code where it makes sense.
pub fn primes(n: usize) -> Vec<usize> {
if n < 4 {
vec![2]
} else {
let mut r = primes((n as f64).sqrt().ceil() as usize);
let mut fvec: Vec<_> = (0..n).map(|i| r.iter().any(|x| i % x == 0)).collect();
fvec[1] = true;
for (i, x) in fvec.into_iter().enumerate() {
if !x { r.push(i) }
}
r
}
}
The largest actual change was replacing your skip(1)
with fvec[1] = true
, since it was significantly simpler that way.
Of course, the actual sieve of Eratosthenes is simpler, lower overhead, and has a better time complexity.
pub fn primes(n: usize) -> Vec<usize> {
let mut primes = Vec::new();
let mut sieve = vec![true; n];
for p in 2..n {
if sieve[p] {
primes.push(p);
let mut i = p * p;
while i < n {
sieve[i] = false;
i += p;
}
}
}
primes
}
Here's a lightly optimized sieve, where unnecessary work and memory use is avoided.
pub fn primes6(mut n: usize) -> Vec<usize> {
let mut primes = vec![2];
n >>= 1;
// 0 1 2 3 4
// 2, 3, 5, 7, 9, ...
let mut sieve = vec![true; n];
let end = (n as f64).sqrt().ceil() as usize;
for p in 1..end {
if sieve[p] {
let prime = p * 2 + 1;
primes.push(prime);
let mut i = ((prime * prime) - 1) / 2;
while i < n {
sieve[i] = false;
i += prime; // skip by 2·prime
}
}
}
for p in end..n {
if sieve[p] {
let prime = p * 2 + 1;
primes.push(prime);
}
}
primes
}
Upvotes: 4