How to perform flexible serialization of a polymorphic inheritance hierarchy?

Question

I have tried to read carefully all the advice given in the C++FAQ on this subject. I have implemented my system according to item 36.8 and now after few months (with a lot of data serialized), I want to make changes in both public interface of some of the classes and the inheritance structure itself.

class Base
{
public:
   Vector field1() const;
   Vector field2() const; 
   Vector field3() const;
   std::string name() const {return "Base";}
};

class Derived : public Base
{
public:
    std::string name() const {return "Derived";}
};

I would like to know how to make changes such as:

Split Derived into Derived1 and Derived2, while mapping the original Derived into Derived1 for existing data.
Split Base::field1() into Base::field1a() and Base::field1b() while mapping field1 to field1a and having field1b empty for existing data.

I will have to

deserialize all the gigabytes of my old data
convert them to the new inheritance structure
reserialize them in a new and more flexible way.

I would like to know how to make the serialization more flexible, so that when I decide to make some change in the future, I would not be facing conversion hell like now.

I thought of making a system that would use numbers instead of names to serialize my objects. That is for example Base = 1, Derived1 = 2, ... and a separate number-to-name system that would convert numbers to names, so that when I want to change the name of some class, I would do it only in this separate number-to-name system, without changing the data.

The problems with this approach are:

The system would be brittle. That is changing anything in the number-to-name system would possibly change the meaning of gigabytes of data.
The serialized data would lose some of its human readability, since in the serialized data, there would be numbers instead of names.

I am sorry for putting so many issues into one question, but I am inexperienced at programming and the problem I am facing seems so overwhelming that I just do not know where to start.

Any general materials, tutorials, idioms or literature on flexible serialization is most welcomed!

James Kanze · Accepted Answer

It's probably a bit late for that now, but whenever designing a serialization format, you should provide for versionning. This can be mangled into the type information in the stream, or treated as a separate (integer) field. When writing the class out, you always write the latest version. When reading, you have to read both the type and the version before you can construct; if you're using the static map suggested in the FAQ, then the key would be:

struct DeserializeKey
{
    std::string type;
    int version;
};

Given the situation you are in now, the solution is probably to mangle the version into the type name in a clearly recognizable way, say something along the lines of type_name__version; if the type_name isn't followed by two underscore, then use 0. This isn't the most efficient method, but it's usually acceptable, and will solve the problem with backwards compatibility, while providing for evolution in the future.

For your precise questions:

In this case, Derived is just a previous version of Derived1. You can insert the necessary factory function into the map under the appropriate key.
This is just classical versionning. Version 0 of Base has a field1 attribute, and when you deserialize, you use it to initialize field1a, and you initialize field1b empty. Version 2 of Base has both.

If you mangle the version into the type name, as I suggest above, you shouldn't have to convert any existing data. Long term, of course, either some of the older versions simply disappear from your data sets, so that you can remove the support for them, or your program keeps getting bigger, with support for lots of older versions. In practice, I've usually seen the latter.

How to perform flexible serialization of a polymorphic inheritance hierarchy?

Answers (2)

Related Questions