Can Spark optimize multiple passes through an RDD?

Question

This seems like a really naive question but I can't find a straight answer anywhere.

I'm using Spark RDDs to convert a very large TSV file into two sets of key-value pairs to be loaded into a distributed key-value store. I'm not using DataFrames because the TSV doesn't follow a very well-defined schema and a sparse matrix is a better model for it.

One set of key-value pairs represents the original data in an Entity-Attribute-Value model and the other set transposes the keys and values from the first set into an Attibute-Value-Entity model(?) I guess - I just made that term up.

My pseudocode is roughly,

val orig: RDD[String] = sc.textFile("hdfs:///some-file.tsv").cache
val entityAttrPairs = orig.mapPartitions(convertLinesToKVPairs)
val attrEntityPairs = orig.mapPartitions(convertLinesToIndexKVPairs)
entityAttrPairs.saveAsNewAPIHadoopFile("hdfs:///ready-for-ingest/entity-attr")
attrEntityPairs.saveAsNewAPIHadoopFile("hdfs:///ready-for-ingest/attr-entity")

My question is: will the separate calls to mapPartitions cause Spark to iterate over the entire RDD twice? Would I be better off trying to produce both the entity-attr and attr-entity pairs in a single pass through the RDD, even though it will make the code much less readable?

Justin Pihony · Accepted Answer

Yes and no. Since the base RDD is cached then the first map will load it and put it into memory. The second map will require a new iteration as it is a separate branch off of the original RDD. However, that original RDD will be read from the cache this time.

Can Spark optimize multiple passes through an RDD?

Answers (1)

Related Questions