Can I use Spark for custom computation?

Question

I have some (200ish) large zip files (some >1GB) that should be unzipped and processed using Python geo- and imageprocessing libraries. The results will be written as new files in FileStore, and later used for ML tasks in Databricks.

What would the general approach be, if I want to exploit the Spark cluster processing power? I'm thinking of adding the filenames to a DataFrame, and using user defined functions to process them via Select or similar. I believe I should be able to make this run in parallel on the cluster, where the workers will get just the filename, and then load the files locally.

Is this reasonable, or is there some completely different direction I should go?

Update - Or maybe like this:

zipfiles = ...

def f(x):
  print("Processing " + x)

spark = SparkSession.builder.appName('myApp').getOrCreate()
rdd = spark.sparkContext.parallelize(zipfiles)
rdd.foreach(f)

Update 2: For anyone doing this. Since Spark by default will reserve almost all available memory you might have to reduce that with this setting: spark.executor.memory 1g Or you might run out of memory quickly on the worker.

shay__ · Accepted Answer

Yes, you can use Spark as a generic-parallel-processing engine, give or take some serialization issues. For example, in one project I've used spark to scan many bloom filters in parallel, and random-access indexed files where the bloom filters returned positive. Most likely you will need to use the RDD api for such tailor made solutions.

Can I use Spark for custom computation?

Answers (1)

Related Questions