Reputation: 183
There is something that baffles me with integer arithmetic in tutorials. To be precise, integer division.
The seemingly preferred method is by casting the divisor into a float, then rounding the float to the nearest whole number, then cast that back into integer:
#include <cmath>
int round_divide_by_float_casting(int a, int b){
return (int) std::roundf( a / (float) b);
}
Yet this seems like scratching your left ear with your right hand. I use:
int round_divide (int a, int b){
return a / b + a % b * 2 / b;
}
It's no breakthrough, but the fact that it is not standard makes me wonder if I am missing anything?
Despite my (albeit limited) testing, I couldn't find any scenario where the two methods give me different results. Did someone run into some sort of scenario where the int → float → int casting produced more accurate results?
Upvotes: 5
Views: 749
Reputation: 40060
If one defined what your functions should return, she would describe it as something close as "f(a, b)
returns the closest integer of the result of the division of a
by b
in the real divisor ring."
Thus, the question can be summarized as: can we define this closest integer using only integer division. I think we can.
There is exactly two candidates as the closest integer: a / b
and (a / b) + 1
(1). The selection is easy, if a % b
is closer to 0
as it is to b
, then a / b
is our result. If not, (a / b) + 1
is.
One could then write something similar to, ignoring optimization and good practices:
int divide(int a, int b)
{
const int quot = a / b;
const int rem = a % b;
int result;
if (rem < b - rem) {
result = quot;
} else {
result = quot + 1;
}
return result;
}
While this definition satisfies out needs, one could optimize it by not computing two times the division of a
by b
with the use of std::div()
:
int divide(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem >= b - dv.rem) {
++result;
}
return result;
}
The analysis of the problem we did earlier assures us of the well defined behaviour of our implementation.
(1)There is just one last thing to check: how does it behaves when a
or b
is negative? This is left to the reader ;).
#include <iostream>
#include <iomanip>
#include <string>
// solutions
#include <cmath>
#include <cstdlib>
// benchmak
#include <limits>
#include <random>
#include <chrono>
#include <algorithm>
#include <functional>
//
// Solutions
//
namespace
{
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem >= b - dv.rem)
{
++result;
}
return result;
}
}
//
// benchmark
//
class Randomizer
{
std::mt19937 _rng_engine;
std::uniform_int_distribution<int> _distri;
public:
Randomizer() : _rng_engine(std::time(0)), _distri(std::numeric_limits<int>::min(), std::numeric_limits<int>::max())
{
}
template<class ForwardIt>
void operator()(ForwardIt begin, ForwardIt end)
{
std::generate(begin, end, std::bind(_distri, _rng_engine));
}
};
class Clock
{
std::chrono::time_point<std::chrono::steady_clock> _start;
public:
static inline std::chrono::time_point<std::chrono::steady_clock> now() { return std::chrono::steady_clock::now(); }
Clock() : _start(now())
{
}
template<class DurationUnit>
std::size_t end()
{
return std::chrono::duration_cast<DurationUnit>(now() - _start).count();
}
};
//
// Entry point
//
int main()
{
Randomizer randomizer;
std::array<int, 1000> dividends; // SCALE THIS UP (1'000'000 would be great)
std::array<int, dividends.size()> divisors;
std::array<int, dividends.size()> results;
randomizer(std::begin(dividends), std::end(dividends));
randomizer(std::begin(divisors), std::end(divisors));
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = round_divide_by_float_casting(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "round_divide_by_float_casting(): " << std::setprecision(3) << unit_time << " ns\n";
}
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = round_divide_by_modulo(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "round_divide_by_modulo(): " << std::setprecision(3) << unit_time << " ns\n";
}
{
Clock clock;
auto dividend = std::begin(dividends);
auto divisor = std::begin(divisors);
auto result = std::begin(results);
for ( ; dividend != std::end(dividends) ; ++dividend, ++divisor, ++result)
{
*result = divide_by_quotient_comparison(*dividend, *divisor);
}
const float unit_time = clock.end<std::chrono::nanoseconds>() / static_cast<float>(results.size());
std::cout << std::setw(40) << "divide_by_quotient_comparison(): " << std::setprecision(3) << unit_time << " ns\n";
}
}
Outputs:
g++ -std=c++11 -O2 -Wall -Wextra -Werror main.cpp && ./a.out
round_divide_by_float_casting(): 54.7 ns
round_divide_by_modulo(): 24 ns
divide_by_quotient_comparison(): 25.5 ns
The two arithmetic solutions' performances are not distinguishable (their benchmark converges when you scale the bench size up).
Upvotes: 5
Reputation: 183
Thanks for the suggestions so far. To shed some light I made a test setup to compare performance.
#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem <= b - dv.rem) {
++result;
}
return result;
}
int main()
{
int itr = 1000;
//while (true) {
auto begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
divide_by_quotient_comparison(i, j);
}
}
auto end = std::chrono::steady_clock::now();
cout << "divide_by_quotient_comparison(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
round_divide_by_float_casting(i, j);
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_float_casting(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 10; j < itr + 1; j++) {
round_divide_by_modulo(i, j);
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_modulo(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
//}
return 0;
}
The results I got on my machine (i7 with Visual Studio 2015) was as follows: the modulo arithmetic was about twice as fast as the int → float → int casting method. The method relying on std::div_t (suggested by @YSC and @teroi) was faster than the int → float → int, but slower than the modulo arithmetic method.
A second test was performed to avoid certain compiler optimizations pointed out by @YSC:
#include <iostream>
#include <string>
#include <cmath>
#include <cstdlib>
#include <chrono>
#include <vector>
using namespace std;
int round_divide_by_float_casting(int a, int b) {
return (int)roundf(a / (float)b);
}
int round_divide_by_modulo(int a, int b) {
return a / b + a % b * 2 / b;
}
int divide_by_quotient_comparison(int a, int b)
{
const std::div_t dv = std::div(a, b);
int result = dv.quot;
if (dv.rem <= b - dv.rem) {
++result;
}
return result;
}
int main()
{
int itr = 100;
vector <int> randi, randj;
for (int i = 0; i < itr; i++) {
randi.push_back(rand());
int rj = rand();
if (rj == 0)
rj++;
randj.push_back(rj);
}
vector<int> f, m, q;
while (true) {
auto begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
q.push_back( divide_by_quotient_comparison(randi[i] , randj[j]) );
}
}
auto end = std::chrono::steady_clock::now();
cout << "divide_by_quotient_comparison(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
f.push_back( round_divide_by_float_casting(randi[i], randj[j]) );
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_float_casting(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
begin = chrono::steady_clock::now();
for (int i = 0; i < itr; i++) {
for (int j = 0; j < itr; j++) {
m.push_back( round_divide_by_modulo(randi[i], randj[j]) );
}
}
end = std::chrono::steady_clock::now();
cout << "round_divide_by_modulo(,) function took: "
<< chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()
<< endl;
cout << endl;
f.clear();
m.clear();
q.clear();
}
return 0;
}
In this second test the slowest was the divide_by_quotient()
reliant on std::div_t, followed by divide_by_float()
, and the fastest again was the divide_by_modulo()
. However this time the performance difference was much, much lower, less than 20%.
Upvotes: 0
Reputation: 1087
Prefer the standard solution. Use std::div family of functions declared in cstdlib.
See: http://en.cppreference.com/w/cpp/numeric/math/div
Casting to float and then to int may be very inefficient on some architectures, for example, microcontrollers.
Upvotes: 1
Reputation: 129324
It would really depend on the processor, and the range of the integer which is better (and using double
would resolve most of the range issues)
For modern "big" CPUs like x86-64 and ARM, integer division and floating point division are roughly the same effort, and converting an integer to a float or vice versa is not a "hard" task (and does the correct rounding directly in that conversion, at least), so most likely the resulting operations are.
atmp = (float) a;
btmp = (float) b;
resfloat = divide atmp/btmp;
return = to_int_with_rounding(resfloat)
About four machine instructions.
On the other hand, your code uses two divides, one modulo and a multiply, which is quite likely longer on such a processor.
tmp = a/b;
tmp1 = a % b;
tmp2 = tmp1 * 2;
tmp3 = tmp2 / b;
tmp4 = tmp + tmp3;
So five instructions, and three of those are "divide" (unless the compiler is clever enough to reuse a / b
for a % b
- but it's still two distinct divides).
Of course, if you are outside the range of number of digits that a float or double can hold without losing digits (23 bits for float, 53 bits for double), then your method MAY be better (assuming there is no overflow in the integer math).
On top of all that, since the first form is used by "everyone", it's the one that the compiler recognises and can optimise.
Obviously, the results depend on both the compiler being used and the processor it runs on, but these are my results from running the code posted above, compiled through clang++
(v3.9-pre-release, pretty close to released 3.8).
round_divide_by_float_casting(): 32.5 ns
round_divide_by_modulo(): 113 ns
divide_by_quotient_comparison(): 80.4 ns
However, the interesting thing I find when I look at the generated code:
xorps %xmm0, %xmm0
cvtsi2ssl 8016(%rsp,%rbp), %xmm0
xorps %xmm1, %xmm1
cvtsi2ssl 4016(%rsp,%rbp), %xmm1
divss %xmm1, %xmm0
callq roundf
cvttss2si %xmm0, %eax
movl %eax, 16(%rsp,%rbp)
addq $4, %rbp
cmpq $4000, %rbp # imm = 0xFA0
jne .LBB0_7
is that the round
is actually a call. Which really surprises me, but explains why on some machines (particularly more recent x86 processors), it is faster.
g++
gives better results with -ffast-math
, which gives around:
round_divide_by_float_casting(): 17.6 ns
round_divide_by_modulo(): 43.1 ns
divide_by_quotient_comparison(): 18.5 ns
(This is with increased count to 100k values)
Upvotes: 3