Reputation: 6248
I am new to Spark. I have a scenario where I need to read and process CSV file from AWS s3. This file is generated on daily basis, so I need to read and process it and dump data into Postgres.
I want to process this huge file in parallel to save time and memory.
I came up with two design but I am a little bit confused about spark as spark context requires connection to be open with all s3 bucket.
Can anyone help me to get the right direction?
Upvotes: 2
Views: 1614
Reputation: 13430
If it's daily, and only 100MB, you don't really need much in the way of large scale tooling. I'd estimate < minute for basic download and process, even remotely, after which bomes the postgres load. Which Postgres offers
try doing this locally, with an aws s3 cp
to copy to your local system, then try with postgres.
I wouldn't bother with any parallel tooling; even Spark is going to want to work with 32-64MB blocks, so you won't get more than 2-3 workers. And if the file is .gz, you get exactly one.
That said, if you want to learn spark, you could do this in spark-shell. Download locally first though, just to save time and money.
Upvotes: 1