J Kiefer
J Kiefer

Reputation: 37

How does one properly deserialize a byte array back into an object in C++?

My team has been having this issue for a few weeks now, and we're a bit stumped. Kindness and knowledge would be gracefully received!

Working with an embedded system, we are attempting to serialize an object, send it through a Linux socket, receive it in another process, and deserialize it back into the original object. We have the following deserialization function:

 /*! Takes a byte array and populates the object's data members */
std::shared_ptr<Foo> Foo::unmarshal(uint8_t *serialized, uint32_t size)
{
  auto msg = reinterpret_cast<Foo *>(serialized);
  return std::shared_ptr<ChildOfFoo>(
        reinterpret_cast<ChildOfFoo *>(serialized));
}

The object is successfully deserialzed and can be read from. However, when the destructor for the returned std::shared_ptr<Foo> is called, the program segfaults. Valgrind gives the following output:

==1664== Process terminating with default action of signal 11 (SIGSEGV)
==1664==  Bad permissions for mapped region at address 0xFFFF603800003C88
==1664==    at 0xFFFF603800003C88: ???
==1664==    by 0x42C7C3: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:149)
==1664==    by 0x42BC00: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (shared_ptr_base.h:666)
==1664==    by 0x435999: std::__shared_ptr<ChildOfFoo, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (shared_ptr_base.h:914)
==1664==    by 0x4359B3: std::shared_ptr<ChildOfFoo>::~shared_ptr() (shared_ptr.h:93)

We're open to any suggestions at all! Thank you for your time :)

Upvotes: 1

Views: 4339

Answers (2)

Jeremy Friesner
Jeremy Friesner

Reputation: 73081

In general, this won't work:

auto msg = reinterpret_cast<Foo *>(serialized);

You can't just take an arbitrary array of bytes and pretend it's a valid C++ object (even if reinterpret_cast<> allows you to compile code that attempts to do so). For one thing, any C++ object that contains at least one virtual method will contain a vtable pointer, which points to the virtual-methods table for that object's class, and is used whenever a virtual method is called. But if you serialize that pointer on computer A, then send it across the network and deserialize and then try to use the reconstituted object on computer B, you'll invoke undefined behavior because there is no guarantee that that class's vtable will exist at the same memory location on computer B that it did on computer A. Also, any class that does any kind of dynamic memory allocation (e.g. any string class or container class) will contain pointers to other objects that it allocated, and that will lead you to the same invalid-pointer problem.

But let's say you've limited your serializations to only POD (plain old Data) objects that contain no pointers. Will it work then? The answer is: possibly, in very specific cases, but it will be very fragile. The reason for that is that the compiler is free to lay out the class's member variables in memory in different ways, and it will insert padding differently on different hardware (or even with different optimization settings, sometimes), leading to a situation where the bytes that represent a particular Foo object on computer A are different from the bytes that would represent that same object on computer B. On top of that you may have to to worry about different word-lengths on different computers (e.g. long is 32-bit on some architectures and 64-bit on others), and different endian-ness (e.g. Intel CPUs represent values in little-endian form while PowerPC CPUs typically represent them in big-endian). Any one of these differences will cause your receiving computer to misinterpret the bytes it received and thereby corrupt your data badly.

So the remaining part of the question is, what is the proper way to serialize/deserialize a C++ object? And the answer is: you have to do it the hard way, by writing a routine for each class that does the serialization member-variable by member-variable, taking the class's particular semantics into account. For example, here are some methods that you might have your serializable classes define:

// Serialize this object's state out into (buffer)
// (buffer) must point to at least FlattenedSize() bytes of writeable space
void Flatten(uint8_t *buffer) const;

// Return the number of bytes this object will require to serialize
size_t FlattenedSize() const;

// Set this object's state from the bytes in (buffer)
// Returns true on success, or false on failure
bool Unflatten(const uint8_t *buffer, size_t size);

... and here's an example of a simple x/y point class that implements the methods:

class Point
{
public:
    Point() : m_x(0), m_y(0) {/* empty */}
    Point(int32_t x, int32_t y) : m_x(x), m_y(y) {/* empty */}

    void Flatten(uint8_t *buffer) const
    {
       const int32_t beX = htonl(m_x);
       memcpy(buffer, &beX, sizeof(beX));
       buffer += sizeof(beX);
       
       const int32_t beY = htonl(m_y);
       memcpy(buffer, &beY, sizeof(beY));
    }

    size_t FlattenedSize() const {return sizeof(m_x) + sizeof(m_y);}

    bool Unflatten(const uint8_t *buffer, size_t size)
    {
       if (size < FlattenedSize()) return false;

       int32_t beX;
       memcpy(&beX, buffer, sizeof(beX);
       m_x = ntohl(beX);

       buffer += sizeof(beX);
       int32_t beY;
       memcpy(&beY, buffer, sizeof(beY));
       m_y = ntohl(beY);

       return true;
    }

    int32_t m_x;
    int32_t m_y;
 };

... then your unmarshal function could look like this (note I've made it templated so that it will work for any class that implements the above methods):

/*! Takes a byte array and populates the object's data members */
template<class T> std::shared_ptr<T> unmarshal(const uint8_t *serialized, size_t size)
{
    auto sp = std::make_shared<T>();
    if (sp->Unflatten(serialized, size) == true) return sp;
 
    // Oops, Unflatten() failed!  handle the error somehow here
    [...]
}

If this seems like a lot of work compared to just grabbing the raw memory bytes of your class object and sending them verbatim across the wire, you're right -- it is. But this is what you have to do if you want the serialization to work reliably and not break every time you upgrade your compiler, or change your optimization flags, or want to communicate between computers with different CPU architectures. If you'd rather not do this sort of thing by hand, there are pre-packaged libraries to assist by with (partially) automating the process, such as Google's Protocol Buffers library, or even good old XML.

Upvotes: 6

MNS
MNS

Reputation: 1394

The segfault during destruction occurs because you are creating a shared_ptr object by reinterpret casting a pointer to a uint8_t. During the destruction of the returned shared_ptr object the uint8_t will be released as if it is a pointer to a Foo* and hence the segfault occurs.

Update your unmarshal as given below and try it.

std::shared_ptr<Foo> Foo::unmarshal(uint8_t *&serialized, uint32_t size)
{    
    ChildOfFoo* ptrChildOfFoo = new ChildOfFoo();
    memcpy(ptrChildOfFoo, serialized, size);

    return std::shared_ptr<ChildOfFoo>(ptrChildOfFoo);
}

Here the ownership of the the ChildOfFoo object created by the statement ChildOfFoo* ptrChildOfFoo = new ChildOfFoo(); is transferred to the shared_ptr object returned by the unmarshal function. So when the returned shared_ptr object's destructor is called, it will be properly de-allocated and no segfault occurs.

Hope this help!

Upvotes: 0

Related Questions