Reputation:
Scenario
I have a file descriptor set passed in by a outside party and I am building a system responsible for accepting it and then messages from them, deserializing them and forwarding that to some data store. I have to support this for both protobuf 2 and 3. Neither file descriptor set nor the payload are entities I can control and/or trust.
Since my system acts as a proxy, I am working on security hardening it so that it does not cause resource starvation (CPU and Memory) and brown it out.
Two of dimensions I am looking at is to
The reason I am pursuing 2 is that I am not sure how "efficient" protobuf's encoding is (i.e. if my serialized message takes X bytes, and my deserialized message takes Y bytes, then what is the upper bound of Y / X).
For 1 I think I have it figured out. I will update https://github.com/protocolbuffers/protobuf/blob/6c72d103fdc5edb88ecd6342a5dd8dba88f3356f/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L413 to be some value that our system can handle at load.
However I cannot find a solution for case 2 so far. I was looking at the entry point here : https://github.com/protocolbuffers/protobuf/blob/6c72d103fdc5edb88ecd6342a5dd8dba88f3356f/java/core/src/main/java/com/google/protobuf/DynamicMessage.java#L90 but I cannot seem to find a way to say "read upto X bytes in memory when deserializing, but if we see more, stop and throw an exception/error".
I was hoping if someone can shed light on whether this is a tractable problem or not and whether protobuf supports this.
Upvotes: 1
Views: 216
Reputation: 198133
Implementing 2 is very hard because measuring memory in Java is hard and the amount of memory allocated, even given the descriptor, is unspecified and subject to change (though it's always pretty reasonable). To give a specific example here, protos are likely to have lists allocated for repeated fields, so you have to have the reference to the list, the list object itself, the list's reference to an array, the array itself, etc. Also, you can't really intercept or listen to proto allocations.
That said, I think you can reasonably expect to bound Y/X to a reasonable ratio. The upper bound on Y/X is dependent on the protos described by the descriptor -- if you deserialize an empty message of a type with ten thousand fields, you'll allocate that big message. As a result, if you trust the author of the descriptors but not the author of the message you're deserializing, it's impossible to create an extremely degenerate case.
That said, it's not obvious to me why you actually need to deserialize before you forward to a data store. If you're just forwarding the message along somewhere, then that doesn't require a deserialization step.
Upvotes: 0