ManojP
ManojP

Reputation: 6248

reading CSV file from s3 using spark

I am new to Spark. I have a scenario where I need to read and process CSV file from AWS s3. This file is generated on daily basis, so I need to read and process it and dump data into Postgres.

I want to process this huge file in parallel to save time and memory.

I came up with two design but I am a little bit confused about spark as spark context requires connection to be open with all s3 bucket.

  1. Use spark streaming to read CSV from s3 and process it and convert into JSON row by row and append the JSON data in JSONB column of Postgres.
  2. Use spring & java -> download file on the server then start processing and convert it into JSON.

Can anyone help me to get the right direction?

Upvotes: 2

Views: 1614

Answers (1)

stevel
stevel

Reputation: 13430

If it's daily, and only 100MB, you don't really need much in the way of large scale tooling. I'd estimate < minute for basic download and process, even remotely, after which bomes the postgres load. Which Postgres offers

try doing this locally, with an aws s3 cp to copy to your local system, then try with postgres.

I wouldn't bother with any parallel tooling; even Spark is going to want to work with 32-64MB blocks, so you won't get more than 2-3 workers. And if the file is .gz, you get exactly one.

That said, if you want to learn spark, you could do this in spark-shell. Download locally first though, just to save time and money.

Upvotes: 1

Related Questions