Reputation: 17953

Split a Python List into Chunks with Maximum Memory Size

Given a python list of bytes values:

# actual str values un-important
[
    b'foo',
    b'bar',
    b'baz',
    ...
]

How can the list be broken into chunks where each chunk has the maximum memory size below a certain ceiling?

For example: if the ceiling were 7 bytes, then the original list would be broken up into a list of lists

[
    [b'foo', b'bar'], # sublist 0
    [b'baz'], # sublist 1
    ...
]

And each sublist would be at most 7 bytes, based on accumulated length of the list's contents.

Note: each sub-list should be maximally packed, in the order of the original list. In the example above the first 2 str values were grouped because it is the maximum possible under the 7 byte limit.

Thank you in advance for your consideration and response.

Upvotes: 7

Answers (5)

Shahnawaz Akhtar

Reputation: 509

from sys import getsizeof
import math
def chunkify_list(L, max_size_kb):
    chunk_size_elements = int(math.ceil(len(L)/int(math.ceil(getsizeof(L)/(1024*max_size_kb)))))
    return [L[x: x+chunk_size_elements] for x in range(0, len(L), chunk_size_elements)]

I wrote this code and it works for me. It requires access to math

Upvotes: 0

norok2

Reputation: 26896

The problem of optimal splitting of a sequence such that the elements satisfy a given max/min condition while keeping the order of the elements can be solved greedily. Hence, you need to iterate over the input sequence only once and maintain a buffer of elements. In Python this can be elegantly coded with a generator, which will have the advantage of not needing to create the result.

The bulk of the algorithm for your problem is as follows:

def split_by_size(items, max_size, get_size=len):
    buffer = []
    buffer_size = 0
    for item in items:
        item_size = get_size(item)
        if buffer_size + item_size <= max_size:
            buffer.append(item)
            buffer_size += item_size
        else:
            yield buffer
            buffer = [item]
            buffer_size = item_size
    if buffer_size > 0:
        yield buffer

where the last parameter delegates the issue of determining the size of a given item to the specified callable. I will not dwell upon this, but I will assume that a simple len() will do. Also, this assumes that each element, individually would satisfy the condition, otherwise one should handle also this case.

Testing the above code:

import random


k = 10
n = 15
max_size = 10

random.seed(0)
items = [b'x' * random.randint(1, 2 * k // 3) for _ in range(n)]
print(items)
# [b'xxxx', b'xxxx', b'x', b'xxx', b'xxxxx', b'xxxx', b'xxxx', b'xxx', b'xxxx', b'xxx', b'xxxxx', b'xx', b'xxxxx', b'xx', b'xxx']

print(list(split_by_size(items, k)))
# [[b'xxxx', b'xxxx', b'x'], [b'xxx', b'xxxxx'], [b'xxxx', b'xxxx'], [b'xxx', b'xxxx', b'xxx'], [b'xxxxx', b'xx'], [b'xxxxx', b'xx', b'xxx']]

Also, if you are willing to store the result of the split in a list anyway, the code for the above approach can be made slightly more compact:

def chunks_by_size(items, max_size, get_size=len):
    result = []
    size = max_size + 1
    for item in items:
        item_size = get_size(item)
        size += item_size
        if size > max_size:
            result.append([])
            size = item_size
        result[-1].append(item)
    return result

but also slightly slower (see benchmarks below).

You could also think of using functools.reduce() (basically the same as @NizamMohamed answer), and the code will be shorter but perhaps also less readable:

def chunks_by_size_reduce(items, size, get_size=len):
    return functools.reduce(
        lambda a, b, size=size:
            a[-1].append(b) or a
            if a and sum(get_size(x) for x in a[-1]) + get_size(b) <= size
            else a.append([b]) or a, items, [])

and certainly less efficient as get_size() is being called for every element of the "candidate" inner list for every element considered, which makes this O(n k!), k being the average number of elements in each sub-sequence. For some timings, see benchmarks below.

I would not be surprised to a solution using itertools.accumulate(), but that would also bound to be quite slow.

The simplest approach to speed things up would be to use Cython or Numba. Here, this was applied to split_by_size(). For both of them the code would be unchanged.

Benchmarking all this we obtain (_cy stands for the Cython-compiled version while _nb stands for the Numba-compiled version):

%timeit list(split_by_size(items * 100000, k + 1))
# 10 loops, best of 3: 281 ms per loop
%timeit list(split_by_size_cy(items * 100000, k + 1))
# 10 loops, best of 3: 181 ms per loop
%timeit list(split_by_size_nb(items * 100000, k + 1))
# 100 loops, best of 3: 5.17 ms per loop
%timeit chunks_by_size(items * 100000, k + 1)
# 10 loops, best of 3: 318 ms per loop
%timeit chunks_by_size_reduce(items * 100000, k + 1)
# 1 loop, best of 3: 1.18 s per loop

Note that while the Numba-compiled version is much faster than the alternatives, it is also the most brittle since it requires the forceobj flag set to True, and this may lead to unstable execution.

Anyway, I hardly believe this would be a bottleneck if the final goal is to send the grouped items through some I/O operation.

Note that the algorithm is pretty much the same as other answers, I just find the code here a bit cleaner.

Upvotes: 8

Nizam Mohamed

Reputation: 9240

This solution is with functools.reduce.

l = [b'abc', b'def', b'ghi', b'jklm', b'nopqrstuv', b'wx', b'yz']

reduce(lambda a, b, size=7: a[-1].append(b) or a if a and sum(len(x) for x in a[-1]) + len(b) <= size else a.append([b]) or a, l, [])

a is an empty list and b is an item from the original list.

if a and sum(len(x) for x in a[-1]) + len(b) <= size
check if a is not empty and sum of length of bytes in the last appended list and length of b is not exceeding size.

a[-1].append(b) or a
append b to the last appended list of a and return a if the condition is True.

a.append([b]) or a
make a list with b and append the new list to a and return a

Output;

[[b'abc', b'def'], [b'ghi', b'jklm'], [b'nopqrstuv'], [b'wx', b'yz']]

Upvotes: 2

xilpex

Reputation: 3237

Keeping it short and sweet:

l = [b'foo', b'bar', b'baz']

thresh = 7
out = []
cur_size = 0
for x in l:
    if len(x) > thresh:
        raise ValueError("str too big")
    if cur_size + len(x) > thresh:
        cur_size = 0
    if cur_size == 0:
        out.append([])
    out[-1].append(x)
    cur_size += len(x)

print(out)

This will output:

[[b'foo', b'bar'], [b'baz']]

That should be what you want if I understood correctly. Its very simple; All it does is append the strings from the list and checks the combined size of everything in the current list it is appending to-- if the size plus the next item will be greater than the threshold, it restarts.

Upvotes: 1

Georgina Skibinski

Reputation: 13397

The simple, naïve approach would be:

import sys
import numpy as np

# init input data - as per the comments, data type does matter, 
# for memory calculation, and for the sake of example -
# string is probably the easiest case:

lts=list("abcdefghijklmnopqrstuvwxyz")

data=[{letter: "".join(np.random.choice(lts, np.random.randint(100, 700)))} for letter in lts]

# parameters setup:

threshold=1024
buffer=[]
buffer_len=0
res_data=[]

for el in data:
    len_=sys.getsizeof(list(el.values())[0]) # I assumed it's one key, one value per dictionary (looks like this from your question) 
    if(buffer_len+len_>threshold):
        res_data.append(buffer)
        buffer=[el]
        buffer_len=len_
    else:
        buffer.append(el)
        buffer_len+=len_

if(buffer_len>0):
    res_data.append(buffer)

print(res_data)

Upvotes: 1

Split a Python List into Chunks with Maximum Memory Size

Answers (5)

Related Questions