Splitting a Large S3 File into Lines per File (not bytes per file)

Question

I have an 8 GB file with text lines (each line has a carriage return) in S3. This file is custom formatted and does NOT follow any common format like CSV, pipe, JSON ... I need to split that file into smaller files based on the number of lines, such that each file will contains 100,000 lines or less (assuming the last file can have the remainder of the lines and thus may have less than 100,000 lines).

I need a method that is not based on the file size (i.e. bytes), but the number of lines. Files can't have a single line split across the two.
I need to use Python.
I need to use server-less AWS service like Lambda, Glue ... I can't spin up instances like EC2 or EMR.

So far I found a lot of posts showing how to split by byte size but not by number of lines. Also, I do not want to read that file line by line as it will be just too slow an not efficient.

Could someone show me a starter code or method that could accomplish splitting this 6 GB file that would run fast and not require more than 10 GB of available memory (RAM), at any point?

I am looking for all possible options, as long as the basic requirements above are met...

BIG thank you!

Michael

Splitting a Large S3 File into Lines per File (not bytes per file)

Answers (1)

Related Questions