Reputation: 307
I am searching for a very CPU efficient way to compute floating point modulus one (including negative values) in C. I am using it for normalized phase reduction (wrapping, i.e 7.6 -> 0.6, 0.2->0.2, -1.1 -> 0.9 and so on).
From what I understood, fmod() but also floor() are usually very inefficient. I don't need the function to be strict i.e to take into account nan or inf since I am responsible to pass valid values.
I have always been using
m = x - (float)(int)x;
m += (float)(m<0.f);
// branchless form to add one if m is negative but not zero
which from benchmarks is in general much more efficient than fmod() or using floor() in place of the int cast, but I was wondering if an even more efficient way existed, perhaps based on bit manipulations...
I am coding on 64 bits intel cpus with gcc, but for my purposes I am using 32 bits single precision floats.
I apologize if the same has been addressed somewhere else, but from my search I could not find anything about this specific topic.
EDIT: sorry I realized there was a subtle error in the originally posted code so I had to fix it. 1 must be added if the result (m) is negative, not if x was negative
EDIT2: actually, after benchmarking the same function using x-floor(x) instead of x-(float)(int)x on GCC 12 and all math optimizations on, I must say the former is faster, since GCC is evidently smart enough to replace the floor() inline function with very efficient code (this at least is true on my intel i7). This however may not always be the case with every cpu and compiler, since in other cases both floor() and fmod() are by personal experience very inefficient. Therefore, my quest for a bit manipulation or comparable trick which may result much faster and with every compiler and architecture still applies
Upvotes: 7
Views: 553
Reputation: 12847
A prototype in C++ (I am not up to date with C), the padding logic is still not optimized but if you have AVX512 on your system you could do something like this to process 8 doubles, or 16 floats in one loop. I found a lot of useful here : intrinsics cheat sheet
I used MSVC compiler from Visual Studio 2022
#include <type_traits>
#include <vector>
#include <immintrin.h>
void reduce_phases(std::vector<double>& inputs)
{
static constexpr std::size_t vector_size = 512ul / sizeof(double);
auto number_to_pad = vector_size - (inputs.size() % vector_size);
inputs.insert(inputs.end(), number_to_pad, 0.0);
auto data_ptr = inputs.data();
for (std::size_t n{ 0ul }; n < inputs.size(); n += vector_size, data_ptr += vector_size)
{
auto values = _mm512_load_pd(data_ptr);
auto floors = _mm512_floor_pd(values);
auto result = _mm512_sub_pd(values, floors);
_mm512_store_pd(data_ptr, result);
}
inputs.erase(inputs.end() - number_to_pad, inputs.end());
}
void reduce_phases(std::vector<float>& inputs)
{
static constexpr std::size_t vector_size = 512ul / sizeof(float);
auto number_to_pad = vector_size - (inputs.size() % vector_size);
inputs.insert(inputs.end(), number_to_pad, 0.0);
auto data_ptr = inputs.data();
for (std::size_t n{ 0ul }; n < inputs.size(); n += vector_size, data_ptr += vector_size)
{
auto values = _mm512_load_ps(data_ptr);
auto floors = _mm512_floor_ps(values);
auto result = _mm512_sub_ps(values, floors);
_mm512_store_ps(data_ptr, result);
}
inputs.erase(inputs.end() - number_to_pad, inputs.end());
}
int main()
{
std::vector<double> values{ -1.1, -1.9, -1.5, -0.4, 0.0, 0.4, 1.5, 1.9, 2.1 };
reduce_phases(values);
std::vector<float> float_values{ -1.1f, -1.9f, -1.5f, -0.4f, 0.0f, 0.4f, 1.5f, 1.9f, 2.1f };
reduce_phases(float_values);
return 0;
}
Upvotes: 3