benjist
benjist

Reputation: 2881

Cap'n Proto: Piecewise write large message to disk

I want to create a giant packed data array, and persist it on disk. I'm using writePackedMessageToFd(). However, since the input data is so large (50GB) I need to pieces of the message to disk to free up memory.

Is this possible with the current version of Cap'n Proto?

Side note: This question is different from mentioned duplicate question in that the output does not need to be streamed, e.g. there could theoretically be other options like a growing file that holds the whole (unfinished) message in a first pass. And a second pass could finish the message.

Upvotes: 5

Views: 1075

Answers (1)

Kenton Varda
Kenton Varda

Reputation: 45246

Exactly what you describe probably won't work. When reading a packed message from disk, you must read and unpack the entire message upfront, which will require enough physical RAM to hold the whole thing unpacked.

You have two options:

  1. Break the message up into many chunks. Cap'n Proto messages are self-delimiting, so you can write several messages to a file once at a time, and then later read them back one at a time in the same order.

  2. Don't use packed format. If the message isn't packed, then you can mmap() it. Then, the operating system will read parts into memory as they are accessed, and can flush them back out of memory later as needed. In this case, reading is trivial, but writing the file initially is tricky. Presumably, the process writing the file also doesn't have space for the whole file in memory. Cap'n Proto doesn't currently support writing via mmap (writable mmap is problematic), but there is usually another trick you can do: Probably, large chunks of your message actually originate directly from some input files, i.e. the message embeds huge byte blobs from other files. In this case, you can mmap() in each of those files, and then you can incorporate them into the message using capnp::Orphanage::referenceExternalData(). This way the files don't all have to be memory-resident at the same time; the OS will page in and out each one in sequence as the final output is being written. See this answer for more details and some example code.

Upvotes: 3

Related Questions