Reputation: 422
I have an 8 GB file with text lines (each line has a carriage return) in S3. This file is custom formatted and does NOT follow any common format like CSV, pipe, JSON ... I need to split that file into smaller files based on the number of lines, such that each file will contains 100,000 lines or less (assuming the last file can have the remainder of the lines and thus may have less than 100,000 lines).
So far I found a lot of posts showing how to split by byte size but not by number of lines. Also, I do not want to read that file line by line as it will be just too slow an not efficient.
Could someone show me a starter code or method that could accomplish splitting this 6 GB file that would run fast and not require more than 10 GB of available memory (RAM), at any point?
I am looking for all possible options, as long as the basic requirements above are met...
BIG thank you!
Michael
Upvotes: 1
Views: 2141
Reputation: 3387
boto3.S3.Client.get_object()
method provides object of type StreamingBody
as a response.
StreamingBody.iter_lines()
method documentation states:
Return an iterator to yield lines from the raw stream.
This is achieved by reading chunk of bytes (of size chunk_size) at a time from the raw stream, and then yielding lines from there.
This might suit your use case. General idea is to get that huge file streaming and process its contents as they come. I cannot think of a way to do this without reading the file in some way.
Upvotes: 2