How to achieve vertical parallelism in spark?

Question

Is it possible to run multiple calculations in parallel using spark?

Example cases that could benefit from that:

running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
running numerous atomic tasks for small datasets. For example:

for in_path, out_path in long_ds_list: spark.read(in_path).select('column').distinct().write(out_path)

The closest equivalents I can think of would be SparkR.lapply() or .Net Parallel.ForEach(), but for a cluster environment rather than simpler multi-threading case.

Jacek Laskowski · Accepted Answer

I'd say that Spark is good at scheduling distributed computing tasks and could handle your cases with ease, but you'd have to develop their solutions yourself. I'm not saying it'd take ages, but would require quite a lot of effort since it's below the developer-facing API in Spark SQL, Spark MLlib, Structured Streaming and such.

You'd have to use Spark Core API and create a custom RDD that would know how to describe such computations.

Let's discuss the first idea.

running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.

"column-wise tasks for large columns" seems to suggest that you think about Spark SQL's DataFrames and Spark MLlib's StringIndexer Transformer. They are higher-level APIs that don't offer such features. You're not supposed to deal with the problem using them. It's an optimization feature so you have to go deeper into Spark.

I think you'd have to rewrite the higher-level APIs in Spark SQL and Spark MLlib to use your own optimized custom code where you'd have the feature implemented.

Same with the other requirement, but this time you'd have to be concerned with Spark SQL only (leaving Spark MLlib aside).

Wrapping up, I think both are possible with some development (i.e. not available today).

How to achieve vertical parallelism in spark?

Answers (1)

Related Questions