sim
sim

Reputation: 31

How to split a text file into chunks?

I have tried many methods but it hasn't worked for me. I want to split a text files lines into multiple chunks. Specifically 50 lines per chunk.

Like this [['Line1', 'Line2' -- up to 50] and so on.

Upvotes: 2

Views: 2215

Answers (3)

martineau
martineau

Reputation: 123413

A good way to do it would be to create a generic generator function that could break any sequence up into chunks of any size. Here's what I mean:

from itertools import zip_longest

def grouper(n, iterable):  # Generator function.
    "s -> (s0, s1, ...sn-1), (sn, sn+1, ...s2n-1), (s2n, s2n+1, ...s3n-1), ..."
    FILLER = object()  # Unique object
    for group in zip_longest(*([iter(iterable)]*n), fillvalue=FILLER):
        limit = group.index(FILLER) if group[-1] is FILLER else len(group)
        yield group[:limit]  # Sliced to remove any filler.


if __name__ == '__main__':
    from pprint import pprint

    with open('lorem ipsum.txt') as inf:
        for chunk in grouper(3, inf):
            pprint(chunk, width=90)

If the lorem ipsum.txt file contained these lines:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.
In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In
elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,
dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor
facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.
Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras
pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex
arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.

The result will be the following chunks each composed of 3 lines or less:

('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.\n',
 'In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In\n',
 'elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,\n')
('dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor\n',
 'facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.\n',
 'Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras\n')
('pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex\n',
 'arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.\n')

Update

If you want to remove the newline characters from the end of the lines of the file, you could do it with the same generic grouper() function by passing it a generator expression to preprocess the lines being read without needing to read them all into memory first:

if __name__ == '__main__':
    from pprint import pprint

    with open('lorem ipsum.txt') as inf:
        lines = (line.rstrip() for line in inf)  # Generator expr - cuz outer parentheses.
        for chunk in grouper(3, lines):
            pprint(chunk, width=90)

Output using generator expression:

('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.',
 'In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In',
 'elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,')
('dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor',
 'facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.',
 'Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras')
('pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex',
 'arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.')

Upvotes: 3

Stijn B
Stijn B

Reputation: 360

data.txt (example):

Line2
Line2
Line3
Line4
Line5
Line6
Line7
Line8

Python code:

with open('data.txt', 'r') as file:
    sample = file.readlines()

chunks = []
for i in range(0, len(sample), 3): # replace 3 with 50 in your case
    chunks.append(sample[i:i+3])   # replace 3 with 50 in your case

chunks (in my example, chunks of 3 lines):

[['Line1\n', 'Line2\n', 'Line3\n'], ['Line4\n', 'Line5\n', 'Line6\n'], ['Line7\n', 'Line8']]

You can apply the string.rstrip('\n') method on those lines to remove the \n at the end.

Alternative:
Without reading the whole file in memory (better):

chunks = []

with open('data.txt', 'r') as file:
    while True:
        chunk = []
        for i in range(3): # replace 3 with 50 in your case
            line = file.readline()
            if not line:
                break
            chunk.append(line)
            # or 'chunk.append(line.rstrip('\n')) to remove the '\n' at the ends
        if not chunk:
            break
        chunks.append(chunk)

print(chunks)

Produces same result

Upvotes: 3

Red
Red

Reputation: 27547

You can split the text by each newline using the str.splitlines() method. Then, using a list comprehension, you can use list slices to slice the list at increments of the chunk_size (50 in your case). Below, I used 3 as the chunk_size variable, but you can replace that with 50:

text = '''Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.'''

lines = text.splitlines()
chunk_size = 3
chunks = [lines[i: i + chunk_size] for i in range(0, len(lines), chunk_size)]
print(chunks)

Output:

[['Lorem ipsum dolor sit amet, consectetur adipiscing elit,', 'sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.', 'Ut enim ad minim veniam,'], 
 ['quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.', 'Duis aute irure dolor in reprehenderit in voluptate', 'velit esse cillum dolore eu fugiat nulla pariatur.'], 
 ['Excepteur sint occaecat cupidatat non proident,', 'sunt in culpa qui officia deserunt mollit anim id est laborum.']]

Upvotes: 1

Related Questions