Reputation: 51
I am trying to build a AWS sturdy serverless ETL flow to load (CSV) files from an S3 bucket to a Postgres instance (postgresql9.6). The files are rather big (10gb+), so, while testing I have ran into multiple limitations of Lambda ( /tmp space, memory space, time).
I am looking into AWS Glue, but have the feeling that this might have limitations toward RDS, and that it is a bit overkill for what I am trying to do.
Furthermore, the CSV file contains a multichar delimiter ("~|~") that Postgres COPY FROM cannot process (so, I need to do some preprocessing as well).
Does anyone have experience with a similar situation and managed to do it serverless, or should I stick with a EC2 instance for know?
Upvotes: 0
Views: 145
Reputation: 1243
I've had issues with large files and Lambda as well, just doesn't jive, not enough control of the computing environment.
I had a sketch of a solution to something similar, basically it's two lambdas:
Lambda_1's job is to inspect the S3 object, determine size, and iteratively do an S3 GetObject
and chunk using ByteRangeStart
& ByteRangeEnd
. (See here.) This lambda will need to deal with mid-line splits of the file though. For each complete line it identifies for processing, it calls...
Lambda_2: all your logic and DB calls, but in a much nicer one-line-at-a-time mode.
Update: in terms of efficiency you might want to aggregate 'n' lines in Lambda_1 before calling that 2nd Lambda with 'n' lines as a parameter. In order to minimize lambda time ($), and do many db inserts/updates in Lambda_2 to avoid the overhead there as well.
Upvotes: 1
Reputation: 334
AWS Glue is a good option for this kind of task and it has good support for RDS with JDBC. The steps you can take:
Glue is able to generate the script on the fly if it's all stardard transformation, you might not need to write any code. It's a cool experience for me.
Upvotes: 2