Reputation: 165
I'm making a rust tool which migrates data using the REST API of an internal service. Essentially, it makes a GET request, the returned data is an array of JSON objects which is deserialized into a struct field of type serde_json::Value
, gets a mutable array of it (as_array_mut) for a bit of processing and POSTs the result to another REST API.
This is done in batches of say, 10000 records per request, however the data can unpredictably change in size. Usually it's around 10MiB but sometimes it can jump to over 400MiB which can easily crash the internal service.
Because of this, I want a way to control how many records can be fetched per request based on the size of the response data, in other words, the heap size of Vec<serde_json::Value>
during runtime. I've tried std::mem::size_of_val
and the crate heapsize
but they didn't work. I think one work around would be converting it to a string and get its length (the size doesn't have to be 100% accurate, a rough estimate is fine too) but that would mean there will be two copies of the JSON data. This is my last option but I wanted to know if there's any other alternative and efficient way to get the heap size.
Update - (response to @Caesar):
I was temporarily using this while waiting for any better approaches:
let size = serde_json::to_vec(docs)?.len();
Thanks to @Caesar I did a few benchmarking and here's what I got. size
is is from what I mentioned above, size_new
and size_new_for
is from Caesar's answer, difference being the first one uses .map(|v| { sizeof_val(v) }).sum()
and the second is a simple for-in loop which increments the result to a variable.
rows: 1000
size raw: 1360727, fmt: 1.30 MiB, took: 4.980794ms
size_new raw: 3834194, fmt: 3.66 MiB, took: 716.486µs
size_new_for raw: 3834194, fmt: 3.66 MiB, took: 672.523µs
rows: 10000
size raw: 17778816, fmt: 16.96 MiB, took: 62.151661ms
size_new raw: 43805986, fmt: 41.78 MiB, took: 8.775323ms
size_new_for raw: 43805986, fmt: 41.78 MiB, took: 8.158837ms
rows: 50000
size raw: 84354219, fmt: 80.45 MiB, took: 199.82163ms
size_new raw: 175919470, fmt: 167.77 MiB, took: 26.010926ms
size_new_for raw: 175919470, fmt: 167.77 MiB, took: 27.084353ms
Ignoring the timings, there seems to be a huge difference in size from turning the entire thing to a vector of bytes (serde_json::to_string takes over 2 times longer than serde_json::to_vec but gives the same result). I'm sort of confused as to which one is an over-estimate here, isn't turning the entire thing to a string/byte array supposed to be an over-estimate or have I been using a grossly under-estimated approximation this whole time?
Here's the complete code:
let size = serde_json::to_vec(docs)?.len() as u64;
let size_new: usize = docs.iter().map(|v| {
sizeof_val(v)
}).sum();
let mut size_new_for = 0;
for v in docs.iter() {
size_new_for += sizeof_val(v);
}
Upvotes: 0
Views: 882
Reputation: 8544
Calculating the exact memory size of a serde_json::Value
is somewhat tricky for several reasons
Map
class and ask what capacity its backing allocation hasIn any case, the following function might provide a workable approximation.
fn sizeof_val(v: &serde_json::Value) -> usize {
std::mem::size_of::<serde_json::Value>()
+ match v {
serde_json::Value::Null => 0,
serde_json::Value::Bool(_) => 0,
serde_json::Value::Number(_) => 0, // Incorrect if arbitrary_precision is enabled. oh well
serde_json::Value::String(s) => s.capacity(),
serde_json::Value::Array(a) =>
a.iter().map(sizeof_val).sum()
+ a.capacity() * std::mem::size_of::<serde_json::Value>(),
serde_json::Value::Object(o) => o
.iter()
.map(|(k, v)| {
std::mem::size_of::<String>()
+ k.capacity()
+ sizeof_val(v)
+ std::mem::size_of::<usize>() * 3 // As a crude approximation, I pretend each map entry has 3 words of overhead
})
.sum(),
}
}
A few thoughts (mostly linux-centric):
procfs::process::Process::myself().unwrap().status().unwrap().vmrss.unwrap() * 1024
. The caveat here is that allocators tend to not give memory back to the OS that quickly, so you might over-estimate.eventfd
on memory.oom_control
(but I think you may have to implement that yourself, I don't see a convenient crate for it). ([Edit]: I needed this elsewhere, and it turned out to be tricky.)loupe
crate also implements allocation size measuring, but I don't think it supports serde_json
.)Upvotes: 3