Sturdy serverless ETL flow in AWS

I am trying to build a AWS sturdy serverless ETL flow to load (CSV) files from an S3 bucket to a Postgres instance (postgresql9.6). The files are rather big (10gb+), so, while testing I have ran into multiple limitations of Lambda ( /tmp space, memory space, time).

I am looking into AWS Glue, but have the feeling that this might have limitations toward RDS, and that it is a bit overkill for what I am trying to do.

Furthermore, the CSV file contains a multichar delimiter ("~|~") that Postgres COPY FROM cannot process (so, I need to do some preprocessing as well).

Does anyone have experience with a similar situation and managed to do it serverless, or should I stick with a EC2 instance for know?

Upvotes: 0

Answers (2)

yamori

Reputation: 1243

I've had issues with large files and Lambda as well, just doesn't jive, not enough control of the computing environment.

I had a sketch of a solution to something similar, basically it's two lambdas:

Lambda_1's job is to inspect the S3 object, determine size, and iteratively do an S3 GetObject and chunk using ByteRangeStart & ByteRangeEnd. (See here.) This lambda will need to deal with mid-line splits of the file though. For each complete line it identifies for processing, it calls...

Lambda_2: all your logic and DB calls, but in a much nicer one-line-at-a-time mode.

Update: in terms of efficiency you might want to aggregate 'n' lines in Lambda_1 before calling that 2nd Lambda with 'n' lines as a parameter. In order to minimize lambda time ($), and do many db inserts/updates in Lambda_2 to avoid the overhead there as well.

Upvotes: 1

Haojin

Reputation: 334

AWS Glue is a good option for this kind of task and it has good support for RDS with JDBC. The steps you can take:

Use Glue crawler to infer the schema of your S3 file and save it to a Glue table.
Create a connection to RDS
Start a Glue job withe wizard, use the Glue table as the source and postgres as the target.

Glue is able to generate the script on the fly if it's all stardard transformation, you might not need to write any code. It's a cool experience for me.

Upvotes: 2

Sturdy serverless ETL flow in AWS

Answers (2)

Related Questions