bda
bda

Reputation: 422

Splitting a Large S3 File into Lines per File (not bytes per file)

I have an 8 GB file with text lines (each line has a carriage return) in S3. This file is custom formatted and does NOT follow any common format like CSV, pipe, JSON ... I need to split that file into smaller files based on the number of lines, such that each file will contains 100,000 lines or less (assuming the last file can have the remainder of the lines and thus may have less than 100,000 lines).

  1. I need a method that is not based on the file size (i.e. bytes), but the number of lines. Files can't have a single line split across the two.
  2. I need to use Python.
  3. I need to use server-less AWS service like Lambda, Glue ... I can't spin up instances like EC2 or EMR.

So far I found a lot of posts showing how to split by byte size but not by number of lines. Also, I do not want to read that file line by line as it will be just too slow an not efficient.

Could someone show me a starter code or method that could accomplish splitting this 6 GB file that would run fast and not require more than 10 GB of available memory (RAM), at any point?

I am looking for all possible options, as long as the basic requirements above are met...

BIG thank you!

Michael

Upvotes: 1

Views: 2141

Answers (1)

alexis-donoghue
alexis-donoghue

Reputation: 3387

boto3.S3.Client.get_object() method provides object of type StreamingBody as a response.

StreamingBody.iter_lines() method documentation states:

Return an iterator to yield lines from the raw stream.

This is achieved by reading chunk of bytes (of size chunk_size) at a time from the raw stream, and then yielding lines from there.

This might suit your use case. General idea is to get that huge file streaming and process its contents as they come. I cannot think of a way to do this without reading the file in some way.

Upvotes: 2

Related Questions