Reputation: 17953
Given a python list
of bytes
values:
# actual str values un-important
[
b'foo',
b'bar',
b'baz',
...
]
How can the list be broken into chunks where each chunk has the maximum memory size below a certain ceiling?
For example: if the ceiling were 7 bytes, then the original list would be broken up into a list of lists
[
[b'foo', b'bar'], # sublist 0
[b'baz'], # sublist 1
...
]
And each sublist would be at most 7 bytes, based on accumulated length of the list's contents.
Note: each sub-list should be maximally packed, in the order of the original list. In the example above the first 2 str values were grouped because it is the maximum possible under the 7 byte limit.
Thank you in advance for your consideration and response.
Upvotes: 7
Views: 2777
Reputation: 509
from sys import getsizeof
import math
def chunkify_list(L, max_size_kb):
chunk_size_elements = int(math.ceil(len(L)/int(math.ceil(getsizeof(L)/(1024*max_size_kb)))))
return [L[x: x+chunk_size_elements] for x in range(0, len(L), chunk_size_elements)]
I wrote this code and it works for me. It requires access to math
Upvotes: 0
Reputation: 26896
The problem of optimal splitting of a sequence such that the elements satisfy a given max/min condition while keeping the order of the elements can be solved greedily. Hence, you need to iterate over the input sequence only once and maintain a buffer of elements. In Python this can be elegantly coded with a generator, which will have the advantage of not needing to create the result.
The bulk of the algorithm for your problem is as follows:
def split_by_size(items, max_size, get_size=len):
buffer = []
buffer_size = 0
for item in items:
item_size = get_size(item)
if buffer_size + item_size <= max_size:
buffer.append(item)
buffer_size += item_size
else:
yield buffer
buffer = [item]
buffer_size = item_size
if buffer_size > 0:
yield buffer
where the last parameter delegates the issue of determining the size of a given item to the specified callable.
I will not dwell upon this, but I will assume that a simple len()
will do.
Also, this assumes that each element, individually would satisfy the condition, otherwise one should handle also this case.
Testing the above code:
import random
k = 10
n = 15
max_size = 10
random.seed(0)
items = [b'x' * random.randint(1, 2 * k // 3) for _ in range(n)]
print(items)
# [b'xxxx', b'xxxx', b'x', b'xxx', b'xxxxx', b'xxxx', b'xxxx', b'xxx', b'xxxx', b'xxx', b'xxxxx', b'xx', b'xxxxx', b'xx', b'xxx']
print(list(split_by_size(items, k)))
# [[b'xxxx', b'xxxx', b'x'], [b'xxx', b'xxxxx'], [b'xxxx', b'xxxx'], [b'xxx', b'xxxx', b'xxx'], [b'xxxxx', b'xx'], [b'xxxxx', b'xx', b'xxx']]
Also, if you are willing to store the result of the split in a list
anyway, the code for the above approach can be made slightly more compact:
def chunks_by_size(items, max_size, get_size=len):
result = []
size = max_size + 1
for item in items:
item_size = get_size(item)
size += item_size
if size > max_size:
result.append([])
size = item_size
result[-1].append(item)
return result
but also slightly slower (see benchmarks below).
You could also think of using functools.reduce()
(basically the same as @NizamMohamed answer), and the code will be shorter but perhaps also less readable:
def chunks_by_size_reduce(items, size, get_size=len):
return functools.reduce(
lambda a, b, size=size:
a[-1].append(b) or a
if a and sum(get_size(x) for x in a[-1]) + get_size(b) <= size
else a.append([b]) or a, items, [])
and certainly less efficient as get_size()
is being called for every element of the "candidate" inner list for every element considered, which makes this O(n k!)
, k
being the average number of elements in each sub-sequence. For some timings, see benchmarks below.
I would not be surprised to a solution using itertools.accumulate()
, but that would also bound to be quite slow.
The simplest approach to speed things up would be to use Cython or Numba.
Here, this was applied to split_by_size()
.
For both of them the code would be unchanged.
Benchmarking all this we obtain (_cy
stands for the Cython-compiled version while _nb
stands for the Numba-compiled version):
%timeit list(split_by_size(items * 100000, k + 1))
# 10 loops, best of 3: 281 ms per loop
%timeit list(split_by_size_cy(items * 100000, k + 1))
# 10 loops, best of 3: 181 ms per loop
%timeit list(split_by_size_nb(items * 100000, k + 1))
# 100 loops, best of 3: 5.17 ms per loop
%timeit chunks_by_size(items * 100000, k + 1)
# 10 loops, best of 3: 318 ms per loop
%timeit chunks_by_size_reduce(items * 100000, k + 1)
# 1 loop, best of 3: 1.18 s per loop
Note that while the Numba-compiled version is much faster than the alternatives, it is also the most brittle since it requires the forceobj
flag set to True
, and this may lead to unstable execution.
Anyway, I hardly believe this would be a bottleneck if the final goal is to send the grouped items through some I/O operation.
Note that the algorithm is pretty much the same as other answers, I just find the code here a bit cleaner.
Upvotes: 8
Reputation: 9240
This solution is with functools.reduce
.
l = [b'abc', b'def', b'ghi', b'jklm', b'nopqrstuv', b'wx', b'yz']
reduce(lambda a, b, size=7: a[-1].append(b) or a if a and sum(len(x) for x in a[-1]) + len(b) <= size else a.append([b]) or a, l, [])
a
is an empty list
and b
is an item from the original list
.
if a and sum(len(x) for x in a[-1]) + len(b) <= size
check if a
is not empty and sum of length of bytes
in the last appended list
and length of b
is not exceeding size
.
a[-1].append(b) or a
append b
to the last appended list
of a
and return a
if the condition is True
.
a.append([b]) or a
make a list
with b
and append the new list
to a
and return a
Output;
[[b'abc', b'def'], [b'ghi', b'jklm'], [b'nopqrstuv'], [b'wx', b'yz']]
Upvotes: 2
Reputation: 3237
Keeping it short and sweet:
l = [b'foo', b'bar', b'baz']
thresh = 7
out = []
cur_size = 0
for x in l:
if len(x) > thresh:
raise ValueError("str too big")
if cur_size + len(x) > thresh:
cur_size = 0
if cur_size == 0:
out.append([])
out[-1].append(x)
cur_size += len(x)
print(out)
This will output:
[[b'foo', b'bar'], [b'baz']]
That should be what you want if I understood correctly. Its very simple; All it does is append the strings from the list and checks the combined size of everything in the current list it is appending to-- if the size plus the next item will be greater than the threshold, it restarts.
Upvotes: 1
Reputation: 13397
The simple, naïve approach would be:
import sys
import numpy as np
# init input data - as per the comments, data type does matter,
# for memory calculation, and for the sake of example -
# string is probably the easiest case:
lts=list("abcdefghijklmnopqrstuvwxyz")
data=[{letter: "".join(np.random.choice(lts, np.random.randint(100, 700)))} for letter in lts]
# parameters setup:
threshold=1024
buffer=[]
buffer_len=0
res_data=[]
for el in data:
len_=sys.getsizeof(list(el.values())[0]) # I assumed it's one key, one value per dictionary (looks like this from your question)
if(buffer_len+len_>threshold):
res_data.append(buffer)
buffer=[el]
buffer_len=len_
else:
buffer.append(el)
buffer_len+=len_
if(buffer_len>0):
res_data.append(buffer)
print(res_data)
Upvotes: 1