LollyPop Lolly
LollyPop Lolly

Reputation: 165

Rust serde get runtime heap size of Vec<serde_json::Value>

I'm making a rust tool which migrates data using the REST API of an internal service. Essentially, it makes a GET request, the returned data is an array of JSON objects which is deserialized into a struct field of type serde_json::Value, gets a mutable array of it (as_array_mut) for a bit of processing and POSTs the result to another REST API.

This is done in batches of say, 10000 records per request, however the data can unpredictably change in size. Usually it's around 10MiB but sometimes it can jump to over 400MiB which can easily crash the internal service.

Because of this, I want a way to control how many records can be fetched per request based on the size of the response data, in other words, the heap size of Vec<serde_json::Value> during runtime. I've tried std::mem::size_of_val and the crate heapsize but they didn't work. I think one work around would be converting it to a string and get its length (the size doesn't have to be 100% accurate, a rough estimate is fine too) but that would mean there will be two copies of the JSON data. This is my last option but I wanted to know if there's any other alternative and efficient way to get the heap size.

Update - (response to @Caesar): I was temporarily using this while waiting for any better approaches: let size = serde_json::to_vec(docs)?.len();

Thanks to @Caesar I did a few benchmarking and here's what I got. size is is from what I mentioned above, size_new and size_new_for is from Caesar's answer, difference being the first one uses .map(|v| { sizeof_val(v) }).sum() and the second is a simple for-in loop which increments the result to a variable.

rows: 1000
size raw: 1360727, fmt: 1.30 MiB, took: 4.980794ms
size_new raw: 3834194, fmt: 3.66 MiB, took: 716.486µs
size_new_for raw: 3834194, fmt: 3.66 MiB, took: 672.523µs

rows: 10000
size raw: 17778816, fmt: 16.96 MiB, took: 62.151661ms
size_new raw: 43805986, fmt: 41.78 MiB, took: 8.775323ms
size_new_for raw: 43805986, fmt: 41.78 MiB, took: 8.158837ms

rows: 50000
size raw: 84354219, fmt: 80.45 MiB, took: 199.82163ms
size_new raw: 175919470, fmt: 167.77 MiB, took: 26.010926ms
size_new_for raw: 175919470, fmt: 167.77 MiB, took: 27.084353ms

Ignoring the timings, there seems to be a huge difference in size from turning the entire thing to a vector of bytes (serde_json::to_string takes over 2 times longer than serde_json::to_vec but gives the same result). I'm sort of confused as to which one is an over-estimate here, isn't turning the entire thing to a string/byte array supposed to be an over-estimate or have I been using a grossly under-estimated approximation this whole time?

Here's the complete code:

let size = serde_json::to_vec(docs)?.len() as u64;
let size_new: usize = docs.iter().map(|v| {
    sizeof_val(v)
}).sum();
let mut size_new_for = 0;
for v in docs.iter() {
    size_new_for += sizeof_val(v);
}

Upvotes: 0

Views: 882

Answers (1)

Caesar
Caesar

Reputation: 8544

Calculating the exact memory size of a serde_json::Value is somewhat tricky for several reasons

  • You can't access the underlying Map class and ask what capacity its backing allocation has
  • Allocators have overhead, so even if you know the allocated size, that doesn't translate directly into how much memory you'll need.

In any case, the following function might provide a workable approximation.

fn sizeof_val(v: &serde_json::Value) -> usize {
    std::mem::size_of::<serde_json::Value>()
        + match v {
            serde_json::Value::Null => 0,
            serde_json::Value::Bool(_) => 0,
            serde_json::Value::Number(_) => 0, // Incorrect if arbitrary_precision is enabled. oh well
            serde_json::Value::String(s) => s.capacity(),
            serde_json::Value::Array(a) =>
                a.iter().map(sizeof_val).sum() 
                + a.capacity() * std::mem::size_of::<serde_json::Value>(),
            serde_json::Value::Object(o) => o
                .iter()
                .map(|(k, v)| {
                    std::mem::size_of::<String>()
                        + k.capacity()
                        + sizeof_val(v)
                        + std::mem::size_of::<usize>() * 3 // As a crude approximation, I pretend each map entry has 3 words of overhead
                })
                .sum(),
        }
}

A few thoughts (mostly linux-centric):

  • If you need precise memory sizes, you might be better off by directly measuring your process's memory size via procfs::process::Process::myself().unwrap().status().unwrap().vmrss.unwrap() * 1024. The caveat here is that allocators tend to not give memory back to the OS that quickly, so you might over-estimate.
  • If you're using a custom allocator, you might be able to directly ask it for memory usage statistics.
  • Instead of worrying about controlling the size, you could let the OS warn you about impeding memory overusage by registering an eventfd on memory.oom_control (but I think you may have to implement that yourself, I don't see a convenient crate for it). ([Edit]: I needed this elsewhere, and it turned out to be tricky.)
  • (The loupe crate also implements allocation size measuring, but I don't think it supports serde_json.)

Upvotes: 3

Related Questions