Nick Zavaritsky
Nick Zavaritsky

Reputation: 1489

C++ STL containers are slow when used as simple buffers (I did a benchmark)?

I've always assumed that C++ code is generally on par with C code in terms of efficiency, if not better (Ex: std::sort beats qsort due to comparator inlining). It is also widely accepted that if a dynamic memory buffer is necessary std::vector would be a reasonable first choice.

Lately I've implemented a benchmark to get the hard data and got surprising results. Basically appending to a vector (or a string) in a tight loop is relatively slow.

Please enlighten me if it is possible to make it faster!

See below for the bloody details on the benchmark.

Certainly one option would be to go for custom containers. I desperately want to avoid this route as long as possible for obvious reasons (consistency in the code plus the overhead of converting between incompatible representations of the same thing, say std::string and a custom_string).

About the benchmark.

There is an input buffer and the output buffer. If an input byte passes the validation it is transferred to output as is, otherwize it is encoded and the resulting 3 bytes are transferred (some sort of escaping here). The algorithm being implemented is UTF-8 validation.

Basically here is the code:

// Templated Sink allows us to play with different methods for building
// the output to estimate the relative efficiency of various approaches
// (ex: a large buffer with no bounds checking vs. std::string).
template <typename Sink>
const unsigned char *
fix_utf8_engine(Sink &sink,
                const unsigned char *i, const unsigned char *end)
{
    while (i < end) {

        // for sinks with limited capacity
        if (!sink.check_capacity())
            return i;

        switch (i[0]) {

            case 0x00 ... 0x7f:
                // 1-byte UTF-8 sequence
                sink.template write<1>(i);
                i += 1;
                continue;

...

I've implemented different variations of fix_utf8, including the one writing to a preallocated "infinite" buffer (no bounds checking or growing, the baseline), the one producing results in a dynamically growing malloc-ed buffer (malloc), the one producing std::string (string) and the one producing std::vector (vector).

Here are the results (Ivy Bridge Core i7 laptop, clang-600.0.56 -O3) (based on LLVM 3.5svn)):

             ASCII       Unicode     Unicode     Unicode     Unicode     Unicode     Random
                         small       full        evil mix    evil short  evil long

baseline     0.01307     0.01816     0.01912     0.01909     0.03104     0.03781     0.06127
malloc       0.01798     0.02068     0.02116     0.02095     0.03918     0.04684     0.06909
string       0.02791     0.03045     0.02575     0.02520     0.07871     0.11513     0.09580
vector       0.06210     0.04925     0.04017     0.04027     0.10103     0.15159     0.12871

Different columns are for different kinds of randomly generated input (Unicode small — limited range up to 2 bytes of the resulting UTF-8, evil mix — all kinds of broken UTF-8 data interspersed with normal data, evil short/long — all UTF-8 sequences are truncated by 1 byte).

These result clearly demonstrate that string and vector variants are considerably slower on at least two very important use cases — ASCII and Unicode (small).

Any help for improving these results is greately appreciated!

Upvotes: 1

Views: 1554

Answers (3)

Nick Zavaritsky
Nick Zavaritsky

Reputation: 1489

In my particular environment the following helped.

The original code working with std::string was slow because of the many code pathes ending with a distinct call to insert(). Insert was partially inlined resulting in a massive code bloat (though not inlining at all would be even worse).

The solution that worked was to abandon insert() and to render output using pointers. Once the output pointer catched up with the output buffer end pointer, we determine the output offset within the string, increase string size by a fixed amount and update pointers.

std::string output;
char *opos; // output pos
char *end_obuf;

while (...) {

   if (opos + 6 > end_obuf) { // 6 bytes produced per iteration or less
      output.resize(s.size() + 128);
      // reinit opos, end_obuf pointers
      ...
   }

   // processing
   ...
}

It is important to never write past the string end since there is no legal way to make that data part of the string (data past the end is erased once the string grows, at least with default allocator).

Hence we are actually resizing the string here (intentionally using resize() instead of reserve()).

Growing by a fixed amount is ok; actually the standard requires the buffer to grow exponentially when the content grows linearly. The particular amount (128) is a tradeoff: growing too much is bad since we are going to zero-initialize that many bytes. Growing too little is also bad since we don't want to execute resize code too often.

To summarize many calls to insert() were replaced with the single call to resize() making the generated code more lean. Resize branch is executed infrequently hence amortizing the cost across several loop iterations.

A few closing remarks:

  1. Comparing STL allocator with realloc was indeed unfair. Realloc can often resize the buffer without making a copy. STL allocator always makes a complete copy due to OOP concerns.

    Feature request: it would be nice if STL used realloc for POD types by default.

    "Pessimizing" malloc made the results virtually identical (about 20% of the original difference in the performance were for this sole reason).

  2. It was interesting to see if the dynamic reallocation of the string buffer imposes that much of the performance overhead. Over-reserving plus pessimizing malloc was not enough to close the gap. In this particular problem the data processing speed was more important (note: string size is 8MiB here, with smaller strings the results may change.)

  3. Profile-guided optimization (-fprofile-generate/-fprofile-use) allows to gain another 5-10% consistently, many thanks for this option.

Upvotes: -1

Jason
Jason

Reputation: 1086

Yes, you are correct. Using the vector or string like that is indeed slow. Basically every time you append you're appending to the end of the container, you're potentially asking the container to reallocate a new buffer, copy the data over to the new buffer, and delete the old buffer.

You can improve the results in two ways. First, if you know the size you can use the reserve() method. Second, if you're using a vector, you can switch to a deque, which is what I use when I don't have a good feeling for the size I want. That will work better in this use case. That's because deque doesn't reallocate memory -- it grows gracefully by pages. There's a trade-off on access for that flexibility (one level of indirection), but since you're only inserting right now it shouldn't be applicable.

One comment claims that you reserve enough, but that's not true. You can write more than you read. I'd still use a similar growth strategy to the malloc strategy in check_capacity(). That's going to give you the best apples to apples comparison.

Here is a template version that should implement check_capacity() correctly and work for all STL containers (including deque). Use this instead of std_vector_sink and std_string_sink.

// Write output to a container
template < typename CONTAINER >
struct std_sink
{
    CONTAINER &v_;
    std_sink(CONTAINER &v): v_(v) {}
    bool check_capacity() { 
        if (v_.capacity() - v_.size() <= 8) {
            v_.reserve(v_.size() + (v_.size() / 2));
        }
        return true;
    }
    template<size_t n> void write(const unsigned char *p) {
        v_.insert(v_.end(), p, p+n);
    }
    void write_bad(const unsigned char *p) {
        unsigned char esc[] = {
            utf8b_1(*p),
            utf8b_2(*p),
            utf8b_3(*p)
        };
        v_.insert(v_.end(), esc, esc + 3);
    }
};

Also, try it with deque:

std_sink< std::deque< unsigned char> > sink(result);

Upvotes: 3

mbgda
mbgda

Reputation: 797

They're fast when you understand how to use them. My guess is you're continuously calling vector.push_back(value), which is going to reallocate a new buffer every time it runs out memory (generally it doubles its allocation each time you hit the limit). You want to do a vector.reserve(reasonableCapacity) first.

Upvotes: 1

Related Questions