user187676
user187676

Reputation:

Optimize emulated flatbuffer dictionary

My flatbuffers schema file dict.fbs looks like this:

namespace fbs;

table Dict {
    entries:[DictEntry];
}

table DictEntry {
    key:string (key);
    value:string;
}

root_type Dict;

Now according to the documentation you can emulate a dictionary in Flatbuffers with a sorted vector and binary lookup like this

flatbuffers::FlatBufferBuilder builder(1024);

std::string key, value; 
std::ifstream infile(argv[1]);
std::string outfile(argv[2]);

std::vector<flatbuffers::Offset<DictEntry>> entries;

while (std::getline(infile, key) && std::getline(infile, value)) {
    entries.push_back(CreateDictEntryDirect(builder, key.c_str(), value.c_str()));
}

auto vec = builder.CreateVectorOfSortedTables(&entries);
auto dict = CreateDict(builder, vec);

builder.Finish(dict);

My original word list has 32MB on disk. Now for each word in this list I have a normalized key and a corresponding value. It would be logical if the serialized flatbuffer dict now had twice the size on disk, say 64MB, but in reality the output is 111MB.

Can I optimize this schema to be more compact? What blows up the output to almost 4 times the size?

Upvotes: 2

Views: 1644

Answers (1)

Aardappel
Aardappel

Reputation: 6074

Are the strings relatively small? What is the average length?

Your over head will be: 2 strings, each with a 32bit length field, and possible padding. Then 12 bytes (vtable offset + 2 string offsets) per DictEntry. Then another 32bit offset to that in the vector. So yes, it is possible to increase in size that much if strings are small.

Note that if you used a std::map<std::string, std::string> you'd probably end up using even more memory.

I recommend you try the same thing with FlexBuffers (https://google.github.io/flatbuffers/flexbuffers.html), which has a more compact string representation, and for your purpose should be the same speed (since your data is "stringly typed" anyway).

Upvotes: 1

Related Questions