carl
carl

Reputation: 4436

.join() command with maximum string length in python

I want to join a list of ids to a string, where each id is separated by an 'OR'. In python I can do that with

' OR '.join(list_of_ids)

I am wondering whether there is a way to prevent this string from becoming too large (in terms of bytes). The reason why this is important for me is that I use that string in an API and that API imposes a max length of 4094 bytes. My solution is below, I am just wondering whether there is a better one?

list_of_query_strings = []
substring = list_of_ids[0]
list_of_ids.pop(0)
while list_of_ids:
    new_addition = ' OR ' + list_of_ids[0]
    if sys.getsizeof(substring + new_addition) < 4094:
        substring += new_addition
    else:
        list_of_query_strings.append(substring)
        substring = list_of_ids[0]
    list_of_ids.pop(0)
list_of_query_strings.append(substring)

Upvotes: 4

Views: 5152

Answers (2)

Remolten
Remolten

Reputation: 2682

This is a simpler solution than your current one:

list_of_query_strings = []
one_string = list_of_ids[0]

# Iterate over each id
for id_ in list_of_ids[1:]:
    # Add the id to the substring if it doesn't make it to large
    if len(one_string) + len(id_) + 4 < 4094:
        one_string += ' OR ' + id_
    # Substring too large, so add to the list and reset
    else:
        list_of_query_strings.append(one_string)
        one_string = id_

Upvotes: 3

ShadowRanger
ShadowRanger

Reputation: 155744

Just for fun, an over-engineered solution (that avoids Schlemiel the Painter repeated concatenation algorithms, allowing you to use str.join for efficient combining):

from itertools import count, groupby

class CumulativeLengthGrouper:
    def __init__(self, joiner, maxblocksize):
        self.joinerlen = len(joiner)
        self.maxblocksize = maxblocksize
        self.groupctr = count()
        self.curgrp = next(self.groupctr)
        # Special cases initial case to cancel out treating first element
        # as requiring joiner, without requiring per call special case
        self.accumlen = -self.joinerlen

    def __call__(self, newstr):
        self.accumlen += self.joinerlen + len(newstr)
        # If accumulated length exceeds block limit...
        if self.accumlen > self.maxblocksize:
            # Move to new group
            self.curgrp = next(self.groupctr)
            self.accumlen = len(newstr)
        return self.curgrp

With this, you use itertools.groupby to break up your iterable into pre-sized groups, then join them without using repeated concatenation:

 mystrings = [...]

 myblocks = [' OR '.join(grp) for _, grp in 
             groupby(mystrings, key=CumulativeLengthGrouper(' OR ', 4094)]

If the goal is to produce strings with a given byte size using a specified encoding, you could tweak the CumulativeLengthGrouper to accept a third constructor argument:

class CumulativeLengthGrouper:
    def __init__(self, joiner, maxblocksize, encoding='utf-8'):
        self.encoding = encoding
        self.joinerlen = len(joiner.encode(encoding))
        self.maxblocksize = maxblocksize
        self.groupctr = count()
        self.curgrp = next(self.groupctr)
        # Special cases initial case to cancel out treating first element
        # as requiring joiner, without requiring per call special case
        self.accumlen = -self.joinerlen

    def __call__(self, newstr):
        newbytes = newstr.encode(encoding)
        self.accumlen += self.joinerlen + len(newbytes)
        # If accumulated length exceeds block limit...
        if self.accumlen > self.maxblocksize:
            # Move to new group
            self.curgrp = next(self.groupctr)
            self.accumlen = len(newbytes)
        return self.curgrp

Upvotes: 5

Related Questions