Reputation: 31182
I have string in python containing a large text file (over 1MiB). I need to split it to chunks.
Constrains:
Lines longer than LIMIT can be ommited.
Any idea how to implement this nicely in python?
Thank you in advance.
Upvotes: 1
Views: 1690
Reputation: 31182
Here is my not-so-pythonic solution:
def line_chunks(lines, chunk_limit):
chunks = []
chunk = []
chunk_len = 0
for line in lines:
if len(line) + chunk_len < chunk_limit:
chunk.append(line)
chunk_len += len(line)
else:
chunks.append(chunk)
chunk = [line]
chunk_len = len(line)
chunks.append(chunk)
return chunks
chunks = line_chunks(data.split('\n'), 150)
print '\n---new-chunk---\n'.join(['\n'.join(chunk) for chunk in chunks])
Upvotes: 1
Reputation: 1196
Following the suggestion of Linuxios you could use rfind to find the last newline within the limit and split at this point. If no newline character is found the chunk is too large and can be dismissed.
chunks = []
not_chunked_text = input_text
while not_chunked_text:
if len(not_chunked_text) <= LIMIT:
chunks.append(not_chunked_text)
break
split_index = not_chunked_text.rfind("\n", 0, LIMIT)
if split_index == -1:
# The chunk is too big, so everything until the next newline is deleted
try:
not_chunked_text = not_chunked_text.split("\n", 1)[1]
except IndexError:
# No "\n" in not_chunked_text, i.e. the end of the input text was reached
break
else:
chunks.append(not_chunked_text[:split_index+1])
not_chunked_text = not_chunked_text[split_index+1:]
rfind("\n", 0, LIMIT)
returns the highest index where a newline character was found within the bounds of your LIMIT.
not_chunked_text[:split_index+1]
is needed so that the newline character is included in the chunk
I interpreted the LIMIT as the biggest length of a chunk that is allowed. If a chunk with the length of LIMIT should not be allowed you have to add a -1
after ever LIMIT
in this code.
Upvotes: 2