I have string in python containing a large text file (over 1MiB). I need to split it to chunks. Constrains: chunks can be splited only by newline character, and len(chunk) must be as big as possbile but smaller than LIMIT (i.e. 100KiB) Lines longer than LIMIT can be ommited. Any idea how to implement this nicely in python? Thank you in advance.

pythonsplitnewlinechunks

Michał Šrajer

Reputation: 31182

limited text chunks splitted by new line

I have string in python containing a large text file (over 1MiB). I need to split it to chunks.

Constrains:

chunks can be splited only by newline character, and
len(chunk) must be as big as possbile but smaller than LIMIT (i.e. 100KiB)

Lines longer than LIMIT can be ommited.

Any idea how to implement this nicely in python?

Thank you in advance.

Upvotes: 1

Answers (2)

Michał Šrajer

Reputation: 31182

Here is my not-so-pythonic solution:

def line_chunks(lines, chunk_limit):
    chunks = []
    chunk = []
    chunk_len = 0
    for line in lines:
        if len(line) + chunk_len < chunk_limit:
            chunk.append(line)
            chunk_len += len(line)
        else:
            chunks.append(chunk)
            chunk = [line]
            chunk_len = len(line)
    chunks.append(chunk)
    return chunks

chunks = line_chunks(data.split('\n'), 150)
print '\n---new-chunk---\n'.join(['\n'.join(chunk) for chunk in chunks])

Upvotes: 1

BurningKarl

Reputation: 1196

Following the suggestion of Linuxios you could use rfind to find the last newline within the limit and split at this point. If no newline character is found the chunk is too large and can be dismissed.

chunks = []

not_chunked_text = input_text

while not_chunked_text:
    if len(not_chunked_text) <= LIMIT:
        chunks.append(not_chunked_text)
        break
    split_index = not_chunked_text.rfind("\n", 0, LIMIT)
    if split_index == -1:
        # The chunk is too big, so everything until the next newline is deleted
        try:
            not_chunked_text = not_chunked_text.split("\n", 1)[1]
        except IndexError:
            # No "\n" in not_chunked_text, i.e. the end of the input text was reached
            break
    else:
        chunks.append(not_chunked_text[:split_index+1])
        not_chunked_text = not_chunked_text[split_index+1:]

rfind("\n", 0, LIMIT) returns the highest index where a newline character was found within the bounds of your LIMIT.
not_chunked_text[:split_index+1] is needed so that the newline character is included in the chunk

I interpreted the LIMIT as the biggest length of a chunk that is allowed. If a chunk with the length of LIMIT should not be allowed you have to add a -1 after ever LIMIT in this code.

Upvotes: 2

limited text chunks splitted by new line

Answers (2)

Related Questions