Reputation: 31
I have tried many methods but it hasn't worked for me. I want to split a text files lines into multiple chunks. Specifically 50 lines per chunk.
Like this [['Line1', 'Line2' -- up to 50]
and so on.
Upvotes: 2
Views: 2215
Reputation: 123413
A good way to do it would be to create a generic generator function that could break any sequence up into chunks of any size. Here's what I mean:
from itertools import zip_longest
def grouper(n, iterable): # Generator function.
"s -> (s0, s1, ...sn-1), (sn, sn+1, ...s2n-1), (s2n, s2n+1, ...s3n-1), ..."
FILLER = object() # Unique object
for group in zip_longest(*([iter(iterable)]*n), fillvalue=FILLER):
limit = group.index(FILLER) if group[-1] is FILLER else len(group)
yield group[:limit] # Sliced to remove any filler.
if __name__ == '__main__':
from pprint import pprint
with open('lorem ipsum.txt') as inf:
for chunk in grouper(3, inf):
pprint(chunk, width=90)
If the lorem ipsum.txt
file contained these lines:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.
In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In
elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,
dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor
facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.
Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras
pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex
arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.
The result will be the following chunks each composed of 3 lines or less:
('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.\n',
'In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In\n',
'elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,\n')
('dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor\n',
'facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.\n',
'Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras\n')
('pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex\n',
'arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.\n')
If you want to remove the newline characters from the end of the lines of the file, you could do it with the same generic grouper()
function by passing it a generator expression to preprocess the lines being read without needing to read them all into memory first:
if __name__ == '__main__':
from pprint import pprint
with open('lorem ipsum.txt') as inf:
lines = (line.rstrip() for line in inf) # Generator expr - cuz outer parentheses.
for chunk in grouper(3, lines):
pprint(chunk, width=90)
Output using generator expression:
('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.',
'In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In',
'elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,')
('dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor',
'facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.',
'Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras')
('pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex',
'arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.')
Upvotes: 3
Reputation: 360
data.txt (example):
Line2
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Python code:
with open('data.txt', 'r') as file:
sample = file.readlines()
chunks = []
for i in range(0, len(sample), 3): # replace 3 with 50 in your case
chunks.append(sample[i:i+3]) # replace 3 with 50 in your case
chunks
(in my example, chunks of 3 lines):
[['Line1\n', 'Line2\n', 'Line3\n'], ['Line4\n', 'Line5\n', 'Line6\n'], ['Line7\n', 'Line8']]
You can apply the string.rstrip('\n')
method on those lines to remove the \n
at the end.
Alternative:
Without reading the whole file in memory (better):
chunks = []
with open('data.txt', 'r') as file:
while True:
chunk = []
for i in range(3): # replace 3 with 50 in your case
line = file.readline()
if not line:
break
chunk.append(line)
# or 'chunk.append(line.rstrip('\n')) to remove the '\n' at the ends
if not chunk:
break
chunks.append(chunk)
print(chunks)
Produces same result
Upvotes: 3
Reputation: 27547
You can split the text by each newline using the str.splitlines()
method. Then, using a list comprehension, you can use list slices to slice the list at increments of the chunk_size (50
in your case). Below, I used 3
as the chunk_size
variable, but you can replace that with 50
:
text = '''Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.'''
lines = text.splitlines()
chunk_size = 3
chunks = [lines[i: i + chunk_size] for i in range(0, len(lines), chunk_size)]
print(chunks)
Output:
[['Lorem ipsum dolor sit amet, consectetur adipiscing elit,', 'sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.', 'Ut enim ad minim veniam,'],
['quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.', 'Duis aute irure dolor in reprehenderit in voluptate', 'velit esse cillum dolore eu fugiat nulla pariatur.'],
['Excepteur sint occaecat cupidatat non proident,', 'sunt in culpa qui officia deserunt mollit anim id est laborum.']]
Upvotes: 1