Reputation: 1
I've been following the tutorial series "Ray Tracing in One Weekend" which seems relatively canonical in terms of learning ray tracing.
I've been trying to speed up the code using OpenMP but I've received some rather disappointing results, and from this Github discussion I believe that others have accomplished better speedup.
Here's what some of the users I've seen so far report:
It's also worth noting that the changes are very local: Write into buffer instead of std::cout line; new method to write buffer to file; single OpenMP line above for loop over rows of the output image.
I was able to achieve significant speedup, the need for which became painfully apparent in the final scene of the second book, by using #pragma omp parallel for before the multi sample loop
With OpenMP, you can achieve this with two lines of OpenMP annotations and not a single change to the code itself
Using these strategies, I was unable to get any significant speedup.
I made a copy of the latest version of the Github repo (v4.0.1) and worked on the "In One Weekend" section.
I added the annotation #pragma omp parallel for reduction(+:pixel_color)
around the sample for loop (has header for (int sample = 0; sample < samples_per_pixel; sample++)
), and the header #pragma omp declare reduction(+ : vec3 : omp_out+=omp_in) initializer(omp_priv(0,0,0))
to define the reduction.
I timed only how long it took to complete cam.render(world)
using std::chrono::steady_clock
. Rendering the default scene, this gave me a speedup of only 1.74x. However, this feels suspiciously low considering I'm using 8 cores (and verified that number with omp_get_num_procs()
).
I reverted to commit 2e5cc2e (for no other reason than the fact that it was released on Dec 9, 2020, the same day another Github user made a post about achieving significant speedup with OpenMP). I modified the code to write to the 2D vector image
instead of outputting to stdout, and then after, write image
to a file. I timed how long it took to populate the colors in image
. The modified code in scene.h
looks like this:
std::vector<std::vector<color>> image(image_height, std::vector<color>(image_width));
omp_set_num_threads(8);
#pragma omp parallel for
for (int j = image_height-1; j >= 0; --j) {
for (int i = 0; i < image_width; ++i) {
color pixel_color(0,0,0);
for (int s = 0; s < samples_per_pixel; ++s) {
auto u = (i + random_double()) / (image_width-1);
auto v = (j + random_double()) / (image_height-1);
ray r = cam.get_ray(u, v);
pixel_color += ray_color(r, max_depth);
}
image[j][i] = pixel_color * pixel_samples_scale;
}
}
This gave me a speedup of only 2.34x, but again, I'm using 8 cores and would expect something higher.
I have been compiling with these C++ flags: -O3 -Wall -std=c++17 -m64 -I. -fopenmp
. All header files are protected with #ifndef
, so no need to use #pragma once
at the top. I've also experimented with using schedule(dynamic)
(which seems reasonable for raytracing), but that only made the speedup lower.
More information surrounding this question can again be seen at this Github discussion that has recently been created. I believe that faster speedup should be easy to achieve, I'm just not sure why my headers do not provide that.
Thanks for any input and let me know if I can provide more details.
Upvotes: 0
Views: 84
Reputation: 490538
One possibility would be the use of two calls to random_double
in each iteration of the inner loop of your code.
The books provide two separate implementations of random_double:
inline double random_double() {
static std::uniform_real_distribution<double> distribution(0.0, 1.0);
static std::mt19937 generator;
return distribution(generator);
}
and:
inline double random_double() {
// Returns a random real in [0,1).
return std::rand() / (RAND_MAX + 1.0);
}
If you're using the version that's a wrapper around std::rand()
, problems with scaling are quite understandable. The source of the problem is fairly simple: std::rand
normally has a seed to maintain state between one call and the next. During each call, that state is updated. There are a couple of different ways to do this while preventing the seed from being corrupted. One is to use a mutex, so calls to std::rand
are (mostly) serialized. Another is to (behind the scenes) create a thread-local seed value, so each thread gets its own see to play with, and each can update the seed without affecting other threads. That introduces some difficulties of its own, but it does scale much better as you add more threads.
std::mt19937
is a C++ object though. That object contains the seed for the generator. Each object of that type has its own seed. So although you may need to do a tiny bit of extra work to assure that each thread has its own random number generator object, when/if you do so, it pretty much assures that you won't have state shared between the threads to limit scaling as you execute with more threads.
Upvotes: 1