wpzdm
wpzdm

Reputation: 134

Using std:vector as low level buffer

The usage here is the same as Using read() directly into a C++ std:vector, but with an acount of reallocation.

The size of input file is unknown, thus the buffer is reallocated by doubling size when file size exceeds buffer size. Here's my code:

#include <vector>
#include <fstream>
#include <iostream>

int main()
{
    const size_t initSize = 1;
    std::vector<char> buf(initSize); // sizes buf to initSize, so &buf[0] below is valid
    std::ifstream ifile("D:\\Pictures\\input.jpg", std::ios_base::in|std::ios_base::binary);
    if (ifile)
    {
        size_t bufLen = 0;
        for (buf.reserve(1024); !ifile.eof(); buf.reserve(buf.capacity() << 1))
        {
            std::cout << buf.capacity() << std::endl;
            ifile.read(&buf[0] + bufLen, buf.capacity() - bufLen);
            bufLen += ifile.gcount();
        }
        std::ofstream ofile("rebuild.jpg", std::ios_base::out|std::ios_base::binary);
        if (ofile)
        {
            ofile.write(&buf[0], bufLen);
        }
    }
}

The program prints the vector capacity just as expected, and writes the output file just the same size as input, BUT, with only the same bytes as input before offset initSize, and all zeros afterward...

Using &buf[bufLen] in read() is definitly an undefined behavior, but &buf[0] + bufLen gets the right postition to write because continuous allocation is guaranteed, isn't it? (provided initSize != 0. Note that std::vector<char> buf(initSize); sizes buf to initSize. And yes, if initSize == 0, a rumtime fatal error ocurrs in my environment.) Do I miss something? Is this also an UB? Does the standard say anything about this usage of std::vector?

Yes, I know we can calculate the file size first and allocate exactly the same buffer size, but in my project, it can be expected that the input files nearly ALWAYS be smaller than a certain SIZE, so I can set initSize to SIZE and expect no overhead (like file size calculation), and use reallocation just for "exception handling". And yes, I know I can replace reserve() with resize() and capacity() with size(), then get things work with little overhead (zero the buffer in every resizing), but I still want to get rid of any redundent operation, just a kind of paranoid...

updated 1:

In fact, we can logically deduce from the standard that &buf[0] + bufLen gets the right postition, consider:

std::vector<char> buf(128);
buf.reserve(512);
char* bufPtr0 = &buf[0], *bufPtrOutofRange = &buf[0] + 200;
buf.resize(256); std::cout << "standard guarantees no reallocation" << std::endl;
char* bufPtr1 = &buf[0], *bufInRange = &buf[200]; 
if (bufPtr0 == bufPtr1)
    std::cout << "so bufPtr0 == bufPtr1" << std::endl;
std::cout << "and 200 < buf.size(), standard guarantees bufInRange == bufPtr1 + 200" << std::endl;
if (bufInRange == bufPtrOutofRange)
    std::cout << "finally we have: bufInRange == bufPtrOutofRange" << std::endl;

output:

standard guarantees no reallocation
so bufPtr0 == bufPtr1
and 200 < buf.size(), standard guarantees bufInRange == bufPtr1 + 200
finally we have: bufInRange == bufPtrOutofRange

And here 200 can be replaced with every buf.size() <= i < buf.capacity() and the similar deduction holds.

updated 2:

Yes, I did miss something... But the problem is not continuity (see update 1), and even not failure to write memory (see my answer). Today I got some time to look into the problem, the program got the right address, wrote the right data into reserved memory, but in the next reserve(), buf is reallocated and with ONLY the elements in range [0, buf.size()) copied to the new memory. So this's the answer to the whole riddle...

Final note: If you needn't reallocation after your buffer is filled with some data, you can definitely use reserve()/capatity() instead of resize()/size(), but if you need, use the latter. Also, under all implementations available here (VC++, g++, ICC), the example works as expected:

const size_t initSize = 1;
std::vector<char> buf(initSize);
buf.reserve(1024*100); // assume the reserved space is enough for file reading
std::ifstream ifile("D:\\Pictures\\input.jpg", std::ios_base::in|std::ios_base::binary);
if (ifile)
{
    ifile.read(&buf[0], buf.capacity());  // ok. the whole file is read into buf
    std::ofstream ofile("rebuld.jpg", std::ios_base::out|std::ios_base::binary);
    if (ofile)
    {
        ofile.write(&buf[0], ifile.gcount()); // rebuld.jpg just identical to input.jpg
    }
}
buf.reserve(1024*200); // horror! probably always lose all data in buf after offset initSize

And here's another example, quoted from 'TC++PL, 4e' pp 1041, note that the first line in the function uses reserve() rather than resize():

void fill(istream& in, string& s, int max)
// use s as target for low-level input (simplified)
{
    s.reserve(max); // make sure there is enough allocated space
    in.read(&s[0],max);
    const int n = in.gcount(); // number of characters read
    s.resize(n);
    s.shrink_to_fit();  // discard excess capacity
}

Update 3 (after 8 years): Many things happened during these years, I did not use C++ as my working language for nearly 6 years, and now I am a PhD student! Also, though many think there are UBs, the reasons they gave are quite different (and some were already shown to be not UBs), indicating this is a complex case. So, before casting votes and write answers, it is highly recommended to read and be involved in comments.

Another thing is that, with the PhD training, I can now dive into the C++ standard with relative ease, which I dared not years ago. I believe I showed in my own answer that, based on the standard, the above two code blocks should work. (The string example requires C++11.) Since my answer is still contentious (but not falsified, I believe), I do not accept it, but rather am open to critical reviews and other answers.

Upvotes: 11

Views: 6691

Answers (2)

Mark Ransom
Mark Ransom

Reputation: 308138

reserve doesn't actually add the space to the vector, it only makes sure that you won't need a reallocation when you resize it. Instead of using reserve you should use resize, then do a final resize once you know how many bytes you actually read in.

All that reserve is guaranteed to do is prevent the invalidation of iterators and pointers as you increase the size of the vector up to capacity(). It is not guaranteed to maintain the contents of those reserved bytes unless they're part of the size().

For example, it's common for code built with a Debug flag to include extra features to make it easier to find bugs. Maybe newly allocated memory will be filled with a well defined pattern. And maybe the class will periodically scan that memory to see if it's changed, and throw an exception if it has under the assumption that only a bug could have caused that change. Such an implementation would still be standard conforming.

The example of std::string is even better, because there's a case that's almost guaranteed to fail. string::c_str() will return a pointer to the string with a null terminator character at the end. Now a conforming implementation could allocate a second buffer with room for the terminating null and return that pointer after copying the string, but that would be very wasteful. Much more likely is that the string class will just make sure its reserved buffer has room for the extra null character and write a null there as necessary. But the standard doesn't dictate when that null will be written, it could be in the call to c_str or it could be at any point where the string might be modified. So you have no way of knowing when one of your bytes is going to be overwritten.

If you really want a buffer of uninitialized bytes, std::vector<char> is probably the wrong tool anyway. You should look at a smart pointer such as std::unique_ptr<char> instead.

Upvotes: 5

wpzdm
wpzdm

Reputation: 134

The bold texts in the answer are my main claims. I have given due effort and care by quoting/referring to the standard, but I am open to the possibility that my reading/understanding would have gaps/errors.

I read C++03 standard because it is shorter and easier, and I believe the related parts are in essence the same in the newest standard. In short, there are no UBs in the last two code blocks of the question, because the reserve()ed memory is well-behaved objects, and the effects of vector operations on the objects are defined by the standard.

It was shown, in the Update 1 of the question, that continuous memory is allocated by reserve(), without reallocation, we can get the right addresses into it. (I can provide the respective standard texts if needed.) The more dubious part is whether the allocated memory can be accessed as in the question (basically, whether we can safely read/write the memory). And let us go into this.

First, the memory is not in some "scratch space". reserve() uses vector's allocator to allocate memory. And the allocator uses operator new (standard 20.4.1.1), which in turn calls an allocation function (18.4.1.1). Thus the storage duration is until a deallocation (e.g., delete) is called on the memory (3.7.3). There would be a concern about lifetime, but this is in fact no problem for us (see below).

Second, is it really as Mark said "nothing is done with them yet - no objects have been constructed there"? First of all, what is an object? (1.8) "An object is a region of storage," that "has a storage duration (3.7) which influences its lifetime (3.8)" and also a type (3.9). Importantly for us, "an object is created by [...] a new-expression". Thus, instead of "nothing is done", we should say an object (here of type char) is created using the allocator! (Of course, the object is not initialized, but this is no problem for us.) Also important for us, because char is POD, the lifetime of the allocated object starts as soon as the storage is obtained (3.8 1). For any POD object, we can memcpy from and back into it, and the value stored there remains the same, even if the value is invalid for the type (e.g., uninitialized garbage)! (3.9 2). Thus, we have the right to read/write the memory (as char objects). Moreover, we can use other defined operations of the type (say "="), because the object is in the lifetime.

In general, we can use POD vectors like buffers as suggested in the last part of the question. Particularly, accessing reserve()ed memory of POD vectors out of size() is well-defined. Precisely, we can access the memory pointed by &vec[m] + n, where m < size() and m+n < capacity() (but &vec[m+n] is UB!).

Keeping in mind that we still have the old size(), we can even reason the defined behaviors of vector methods. For example, the memory out of size() will not be copied after reallocation triggered by reserve(). Becausereserve() only allocates (or reallocates) (uninitialized) memory, the container only needs to copy the objects in size() into the reallocated memory, and outside size() the memory should remain uninitialized.

PS: The last example is from the TC++PL 4ed, and should work only for C++11 and above. In C++11 and above the memory of string is continuous, but not for the lower versions (Does "&s[0]" point to contiguous characters in a std::string?).

Edit: Mark made a good point in the comment: even if we can access the reserve()ed memory, would it be written by the vector out of our control? I believe not. Every operation (method, algorithm) on a container has a standard-defined effect, by a specialized "Effects" paragraph, or by overall requirements (23.1). So, if an operation has an effect on reserve()ed memory, the standard should specify it.

For example, the effect of erase(p1,p2) is "erases the elements in the range [q1, q2)" (23.1.1) and "Invalidates iterators and references at or after the point of the erase" (23.2.4.4). Thus, erase() has no effect on reserve()ed memory.

On the other hand, we know insert() has an effect on reserve()ed memory, but this can be reasoned, and in this sense, we are in control. There is nowhere in the standard that says any container operation has the effect that "could periodically wipe out anything beyond [size()]", so it should not do it!

Upvotes: -1

Related Questions