Using Apache Spark to process files from the web

Question

I have some remote data files that need processing, usually sitting on FTPs or APIs (dumps, not streams). When looking through the Spark documentation, I noticed very sparse support for these data sources, especially when it comes to authentication.

I reckon it's due to the non-distributable nature (and/or rate limits) of possibly ephemeral web links, so I wanted to have this confirmed, so that I can act in accordance with Spark paradigms.

My question is thus: is the modus operandi to download all files to a Spark-supported storage system (using whatever tool we get our hands on) and proceed with Spark only after that?

ganeiy · Accepted Answer

Yes, there are two common design patterns to handle this.
1. Copy the dumps from FTP/APIs into HDFS and run spark
2. Copy the dumps into S3 instead of HDFS, if there is no streaming support.

In both the cases, you can have Spark jobs be kicked off, based on an event you can use a scheduler cron/airflow to take care of this.

Using Apache Spark to process files from the web

Answers (1)

Related Questions