Reputation: 38971

How can std::bitset be faster than std::vector<bool>?

According to this answer the poster expects a std::bitset of size 100k bits to be faster than a std::vector<bool> when querying individual bits. How can this be possible?

How could they even differ significantly in their implementation, if std::bitset apparently allows for arbitrary sizes just like std::vector?

Upvotes: 40

Answers (6)

Raffaello

Reputation: 1706

Honestly I think bitset it's better to use in the stack and not on the heap. moreover the two are not in conflict with one to another because an elegant solution can be something like this:

vector< bitset<64> > v(100000) //or whatever...

it can be interesting a test in comparison of this two instead:

vector<unsigned char> v1(1000000) //8 bits to manage them manually
vector< bitset<8> >   v2(1000000) //8 bits managed by bitset

Moreover, just for adding to the answers here and remind how the compiler depend A LOT in the performance too, here's a simple test done with:

VS2012
mingw/g++ 4.7.0
g++ 4.8.2 on Ubuntu

(but all these tests are a little bit tricky and maybe give us only the rough general idea for a DIRECT comparison. Profiling the project it's the only thing in the end to do.)

VS2012 compiled in Release (default Release provided).
g++ compiled with -O2
g++ cimpiled with -O2 -std=c++11

NOTE:

with size 10^7:

VS2012 crash at runtime. (So I can assume have a memory management different from g++)
g++ it's ok.
g++11 have a problem with test2() reporting time =0, I printed out some values just to trigger the code execution. (I suppose it's a compiler optimization).

I included the overhead time of constructor and destructor of the objects too.

here the simple test code:

#include <iostream>
#include <vector>
#include <bitset>
#include <time.h>

using namespace std;

#define SIZE1 1000000000 //10e9
//#define SIZE2 10000000   //10e7 VS2012 crash at runtime, g++ OK
#define SIZE2 1000000 //10e6

void test1()
{
    register bool j;
    clock_t t1,t2;
    cout.precision(10);

    t1=clock();
    vector<bool> *v = new vector<bool>(SIZE1);
    for(register long int i=0; i<SIZE1;i++)
        (*v)[i] = i%2 == 0? true :false;

    for(register long int i=0; i<SIZE1;i++)
        j=(*v)[i];

    delete v;
    t2=clock();
    cout << "vector speed = " << (t2-t1) / (float) CLOCKS_PER_SEC << " (" << t2 << "," << t1 << ")" << endl;

    t1=clock();
    bitset<SIZE1> *b = new bitset<SIZE1>();
    for(register long int i=0; i<SIZE1;i++)
        (*b)[i] = i%2 == 0? true :false;
    for(register long int i=0; i<SIZE1;i++)
        j=(*b)[i];

    delete b;
    t2=clock();
    cout << "bitset speed = " << (t2-t1) / (float) CLOCKS_PER_SEC << " (" << t2 << "," << t1 << ")" << endl;
}

void test2()
{
    register bool j;
    clock_t t1,t2;
    cout.precision(10);

    t1=clock();
    vector<bool> v(SIZE2);
    for(register int k=0; k<SIZE1/SIZE2; k++)
        for(register long int i=0; i<SIZE2;i++)
            (v)[i] = i%2 == 0? true :false;

    for(register int k=0; k<SIZE1/SIZE2; k++)
        for(register long int i=0; i<SIZE2;i++)
            j=(v)[i];

    t2=clock();
    cout << "vector speed = " << (t2-t1) / (float) CLOCKS_PER_SEC << " (" << t2 << "," << t1 << ")" << endl;
    cout << "v[1], v[2] " <<  (int) v[1] << ", "<< (int)v[2] << endl;

    t1=clock();
    bitset<SIZE2> b;
    for(register int k=0; k<SIZE1/SIZE2; k++)
        for(register long int i=0; i<SIZE2;i++)
            (b)[i] = i%2 == 0? true :false;

    for(register int k=0; k<SIZE1/SIZE2; k++)
        for(register long int i=0; i<SIZE2;i++)
            j=(b)[i];

    t2=clock();
    cout << "bitset speed = " << (t2-t1) / (float) CLOCKS_PER_SEC << " (" << t2 << "," << t1 << ")" << endl;
    cout << "b[1], b[2] " <<  (int) b[1] << ", "<< (int)b[2] << endl;
}


int main(int argc, char* argv[])
{
    test1();
    test2();


    return 0;
}

VS2012 output:

vector speed = 3.105000019 (3105,0)
bitset speed = 10.44400024 (13551,3107)
vector speed = 3.987999916 (17542,13554)
v[1], v[2] 0, 1
bitset speed = 9.772999763 (27318,17545)
b[1], b[2] 0, 1

mingw/g++ output -O2:

vector speed = 1.519 (1520,1)
bitset speed = 1.647 (3168,1521)
vector speed = 1.383999944 (4554,3170)
v[1], v[2] 0, 1
bitset speed = 1.610000014 (6166,4556)
b[1], b[2] 0, 1

mingw/g++ output -O2 -std=c++11:

vector speed = 1.528 (1529,1)
bitset speed = 1.685 (3215,1530)
vector speed = 1.409999967 (4626,3216)
v[1], v[2] 0, 1
bitset speed = 1.763000011 (6392,4629)
b[1], b[2] 0, 1

g++ 4.8.2 output -O2:

vector speed = 1.561391 (1564139,2748)
bitset speed = 1.681818 (3246051,1564233)
vector speed = 1.487877011 (4733975,3246098)
v[1], v[2] 0, 1
bitset speed = 1.685297012 (6419328,4734031)
b[1], b[2] 0, 1

g++ 4.8.2 output -O2 -std=c++11:

vector speed = 1.561391 (1564139,2748)
bitset speed = 1.681818 (3246051,1564233)
vector speed = 1.487877011 (4733975,3246098)
v[1], v[2] 0, 1
bitset speed = 1.685297012 (6419328,4734031)
b[1], b[2] 0, 1

CONCLUSION:

As a rough idea vector seems faster for these use cases.

I don't run multiple istances and average the results, but more or less the values are always the same.

note on VS: I think it use a different memory management mechanism respect to gcc and for these use cases seems slower in the generated code.

Upvotes: 3

estan

Reputation: 1484

Here's my unscientific benchmark of accessing/inserting 3 billion elements from/into bitset<> and vector<bool> of sizes 100K, 1M and 5M. The compiler is GCC 4.8.2 on 64 bit Linux machine (Core i7):

With optimization (compiler flags: -O2 -std=c++11):

[estan@pyret bitset_vs_vector]$ ./bitset_vs_vector 
bitset<100000> (3 billion accesses/inserts): 132.424 ms 
vector<bool>(100000) (3 billion accesses/inserts): 270.577 ms

bitset<1000000> (3 billion accesses/inserts): 67.752 ms 
vector<bool>(1000000) (3 billion accesses/inserts): 268.193 ms

bitset<5000000> (3 billion accesses/inserts): 67.426 ms 
vector<bool>(5000000) (3 billion accesses/inserts): 267.566 ms

Without optimization (compiler flags: -std=c++11):

[estan@pyret bitset_vs_vector]$ make
g++ -std=c++11 -o bitset_vs_vector *.cpp
[estan@pyret bitset_vs_vector]$ ./bitset_vs_vector 
bitset<100000> (3 billion accesses/inserts): 1900.13 ms 
vector<bool>(100000) (3 billion accesses/inserts): 1784.76 ms

bitset<1000000> (3 billion accesses/inserts): 1825.09 ms 
vector<bool>(1000000) (3 billion accesses/inserts): 1768.03 ms

bitset<5000000> (3 billion accesses/inserts): 1846.73 ms 
vector<bool>(5000000) (3 billion accesses/inserts): 1763.48 ms

So it seems under these conditions, bitset is faster than vector when the code is optimized, while vector actually comes out on top by a (very) small margin when it's not.

That said, if your code is time critical you should probably perform benchmarks yourself, since I suspect these numbers are highly compiler/environment specific.

Benchmark code:

#include <iostream>
#include <functional>
#include <bitset>
#include <vector>
#include <ctime>

// Performs N access/insert on container.
template<class T>
void access_and_insert(T &container, int N)
{
    const std::size_t size = container.size();
    for (int i = 0; i < N; ++i) {
        bool v = container[i % size];
        container[i % size] = true;
    }
}

// Measure the time in milliseconds required to call f.
double measure(std::function<void (void)> f)
{
    clock_t start = std::clock();
    f();
    return 1000.0 * (std::clock() - start)/CLOCKS_PER_SEC;
}

int main (void)
{
    // Benchmark with 100K elements.
    std::bitset<100000> bitset100K;
    std::vector<bool> vector100K(100000);
    std::cout << "bitset<100000> (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(bitset100K, 3E7); }) << " ms " << std::endl;
    std::cout << "vector<bool>(100000) (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(vector100K, 3E7); }) << " ms" << std::endl;
    std::cout << std::endl;

    // Benchmark with 1M elements.
    std::bitset<1000000> bitset1M;
    std::vector<bool> vector1M(1000000);
    std::cout << "bitset<1000000> (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(bitset1M, 3E7); }) << " ms " << std::endl;
    std::cout << "vector<bool>(1000000) (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(vector1M, 3E7); }) << " ms" << std::endl;
    std::cout << std::endl;

    // Benchmark with 5M elements.
    std::bitset<5000000> bitset5M;
    std::vector<bool> vector5M(5000000);
    std::cout << "bitset<5000000> (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(bitset5M, 3E7); }) << " ms " << std::endl;
    std::cout << "vector<bool>(5000000) (3 billion accesses/inserts): ";
    std::cout << measure([&]() { access_and_insert(vector5M, 3E7); }) << " ms" << std::endl;

    return 0;
}

Upvotes: 2

Martin Ba

Reputation: 38971

Measurements on Visual Studio 2010 show that std::bitset is not generally faster than std::vector<bool>. What the exact reason for this is I cannot say -- only that bitset is implemented significantly different from the std::vector full specialization.

std::bitset stores it's full content in the object via a

template<size_t _Bits>
    class bitset .....

    _Ty _Array[_Words + 1]; // the set of bits
    };

array and that makes large bitset unsuitable to be put on the stack -- which isn't a performance argument per se.

vector<bool> doesn't suffer from the stack problem, and testing with a size of 1e6 and 1e7 it seems that on my box here querying values in a loop is actually 2x faster with a vector.

Well. I guess the usual timing caveats apply and YMMV, but here's the test code I used should anyone care to try himself:

The output on my box is:

1
vector<bool> loop with a size of 10000000 and 10 iterations*n: 11187 ms
bitset<10000000> loop with 10 iterations*n: 22719 ms
101250010
Press any key to continue . . .

BitMap.cpp

#include "stdafx.h"
#include "BitMap.h"

using namespace std;

// Global var to prevent optimizer from messing things up
volatile size_t ext;

volatile clock_t t1;
volatile clock_t t2;
double delta1;
double delta2;

int main(int argc, _TCHAR* argv[])
{
  ext = 1;
  printf("%d\n", ext);

  vb_t *const vec = new vb_t(bssz);
  bs_t *const bits = new bs_t(); // must put large bitset on heap

  const int iter = 10;
  delta1=0;
  delta2=0;
  for(int o=0; o<5; ++o) {
    t1 = clock();
    for(int i=0; i!=5; ++i)
      bs_loop(iter, *vec);
    t2 = clock();
    delta1 += t2-t1;
    t1 = clock();
    for(int i=0; i!=5; ++i)
      bs_loop(iter, *bits);
    t2 = clock();
    delta2 += t2-t1;
  }

  delta1 /= CLOCKS_PER_SEC;
  delta2 /= CLOCKS_PER_SEC;
  delta1 *= 1000;
  delta2 *= 1000;

  cout << "vector<bool> loop with a size of " << bssz << " and " << iter << " iterations*n: " << delta1 << " ms\n";
  cout << "bitset<" << bssz << "> loop with " << iter << " iterations*n: " << delta2 << " ms\n";

  printf("%d\n", ext);
  delete vec;
  delete bits;
  return 0;
}

BitMap.h

#pragma once
#include <vector>
#include <bitset>

extern volatile size_t ext;
const size_t bssz = size_t(1e7); // 1e7 ca 10m

using namespace std; // Test code, using here is OK.
typedef vector<bool> vb_t;
typedef bitset<bssz> bs_t;

template<class COLL>
void bs_loop(const int iterations, COLL const& v);

bs_loop.cpp

#include "stdafx.h"
#include "BitMap.h"

template<class COLL>
void bs_loop(const int iterations, COLL const& v)
{
  ext = sizeof(COLL);
  for(size_t i=0; i!=iterations; ++i) {
    ++ext;
    for(size_t j=0, e=v.size(); j!=e; ++j) {
      if(v[j]) {
        --ext;
      }
      else {
        ++ext;
      }
    }
  }
}

template
void bs_loop(const int iterations, vb_t const& v);

template
void bs_loop(const int iterations, bs_t const& v);

Compiler command line:

/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /D "WIN32" /D "NDEBUG"
/D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy 
/fp:precise /Zc:wchar_t /Zc:forScope /Yu"StdAfx.h" /Fp"Release\BitMap.pch" 
/Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- 
/errorReport:queue

note the /O2 and the missing /GL (no whole prg opt).

Upvotes: 17

Steve M

Reputation: 8526

Well, since I'm the guy you're basing this question on, here's where I got that idea from:

"…it packs the bools and stores them as individual bits (inside, say, chars) in its internal representation. One consequence of this is that it can't just return a normal bool& from its operator[] or its dereferenced iterators[2]; instead, it has to play games with a helper "proxy" class that is bool-like but is definitely not a bool. Unfortunately, that also means that access into a vector<bool> is slower, because we have to deal with proxies instead of direct pointers and references.

…

Bottom line: If you care more about speed than you do about size, you shouldn't use std::vector<bool>. Instead, you should hack around this optimization by using a std::vector<char> or the like instead, which is unfortunate but still the best you can do."

Or, as I recommended, if you know the biggest size that your set will get, use std::bitset.

Upvotes: 7

Marcin

Reputation: 12620

Also, note that a vector<bool> is a specialization of the vector template, and is implemented quite differently than you might think.

Upvotes: -2

Armen Tsirunyan

Reputation: 133122

the vector accesses its elements with iterators, which can't be a simple typedef for bool*, , which makes it slower than bitset, which doesn't provide iterators. Another thing that makes it fast is that its size is known compile-time and therefore it does no allocation with new, which is slower than stack allocation. Just random thoughts

Upvotes: 1

How can std::bitset be faster than std::vector&lt;bool&gt;?

Answers (6)

Related Questions

How can std::bitset be faster than std::vector<bool>?