Dustin Oprea
Dustin Oprea

Reputation: 10246

FlatBuffers Storage/Size Guarantees

FlatBuffers specifically avoids certain encoding standardization/guarantees. Per the documentation:

(https://google.github.io/flatbuffers/flatbuffers_internals.html)

This may mean two different implementations may produce different binaries given the same input values, and this is perfectly valid.

Okay, but are strings encoded on disk directly from whatever representation they were given or encoded in some other representation? Are encode operations using the same FlatBuffers version and generated code deterministic (N operations with identical parameters produce identical results)?

What about sizing? Will a reduction in the size of dynamic structures (e.g. vectors, string values being made to be shorter) produce a corresponding reduction in the size of the encoded structure?

I really don't understand how the string encoding works and I don't have time at the moment to take apart the inner code.

I created a sample definition that has a general parent->child->grandchild structure, where the parent type has a vector of the child type, and the grandchild type embeds a string as well as a struct. I wanted to exaggerate any entropy that the different types of values might introduce to the output size by including several of them. I then populated the string value in the grandchild with a five-rune string multiplied by fifty, and iteratively decremented the multiplier by one, by hand, and printed the output size of the final encoding every time:

$ go run main.go
String size: (250)
Output encoding size: (400)

$ go run main.go
String size: (245)
Output encoding size: (400)

$ go run main.go
String size: (240)
Output encoding size: (392)

$ go run main.go
String size: (235)
Output encoding size: (384)

Why does the output size of the encoding not change after I drop five bytes from the original string value? Why does it shrink by eight bytes for every five bytes I drop from the original string? Since these are strings, I do not think alignment would play a factor here.

I still have the questions above, but it seems like it might be a safe assumption (er, guarantee) that 1) the size of the encoding is stable for the same arguments, and 2) will shrink along with the reduction in the size of one or more values within it. Is this a true statement?

Thanks for saving me some time and error in not having to hack through this on my own at the moment (hopefully).

For reference, this is the definition:

namespace testformat;

struct Vector {
    field9:ulong;
    field10:ulong;
    field11:ulong;
}

table Grandchild {
    field5:ulong;
    string6:string;
    field7:ulong;
    field8:Vector;
}

table Child {
    field3:ulong;
    field4:ulong;
    grandchild:Grandchild;
}

table Parent {
    field1:ulong;
    field2:ulong;
    children:[Child];
}

root_type Parent;

This is the part of the Go code with the repeated string value that I change (at the top):

stringValue := strings.Repeat("strin", 50)
fmt.Printf("String size: (%d)\n", len(stringValue))

stringOffset := b.CreateString(stringValue)

testformat.GrandchildStart(b)

testformat.GrandchildAddField5(b, 44)
testformat.GrandchildAddString6(b, stringOffset)
testformat.GrandchildAddField7(b, 55)

vectorOffset := testformat.CreateVector(b, 11, 22, 33)
testformat.GrandchildAddField8(b, vectorOffset)

grandchildOffset := testformat.GrandchildEnd(b)

Upvotes: 1

Views: 1499

Answers (1)

Aardappel
Aardappel

Reputation: 6074

This contains many questions, so here are some answers:

  • strings: FlatBuffers stores these as UTF-8, so in languages where the input can already be UTF-8 (such as typically in C++), serializing this data is a straight copy without any processing. In languages that use other representations (such as Java) there is a conversion step while serializing.
  • Sizing: yes, reducing the size of a vector typically reduces the size of the buffer, but not necessarily equally. For example, if you have a vector of shorts, reducing size by 1 element may reduce the buffer by 0 or 4, depending on alignment.
  • In your experiment, you're again seeing the effect of alignment and padding. Everything is aligned to its own size (which how pretty much all programming languages store data under the hood in memory). So if your buffer contains a double or long or whatever, the size of the buffer will only ever change in 8-byte increments. A string is 4-byte aligned for its size, but the actual string data is only 1-byte aligned, so it is possible to add characters to a string without a change in buffer size, since it is simply using more of what were previously padding bytes. Even though string bytes don't need alignment, adjacent data may.
  • 1: only if in the same order, or the same language/implementation. Some implementations may change ordering which affects alignment and padding.
  • 2: As I showed above, reduction in buffer size is only indirectly related to reduction in elements.

Upvotes: 2

Related Questions