Russ Weeks
Russ Weeks

Reputation: 373

Can Spark optimize multiple passes through an RDD?

This seems like a really naive question but I can't find a straight answer anywhere.

I'm using Spark RDDs to convert a very large TSV file into two sets of key-value pairs to be loaded into a distributed key-value store. I'm not using DataFrames because the TSV doesn't follow a very well-defined schema and a sparse matrix is a better model for it.

One set of key-value pairs represents the original data in an Entity-Attribute-Value model and the other set transposes the keys and values from the first set into an Attibute-Value-Entity model(?) I guess - I just made that term up.

My pseudocode is roughly,

val orig: RDD[String] = sc.textFile("hdfs:///some-file.tsv").cache
val entityAttrPairs = orig.mapPartitions(convertLinesToKVPairs)
val attrEntityPairs = orig.mapPartitions(convertLinesToIndexKVPairs)
entityAttrPairs.saveAsNewAPIHadoopFile("hdfs:///ready-for-ingest/entity-attr")
attrEntityPairs.saveAsNewAPIHadoopFile("hdfs:///ready-for-ingest/attr-entity")

My question is: will the separate calls to mapPartitions cause Spark to iterate over the entire RDD twice? Would I be better off trying to produce both the entity-attr and attr-entity pairs in a single pass through the RDD, even though it will make the code much less readable?

Upvotes: 0

Views: 345

Answers (1)

Justin Pihony
Justin Pihony

Reputation: 67075

Yes and no. Since the base RDD is cached then the first map will load it and put it into memory. The second map will require a new iteration as it is a separate branch off of the original RDD. However, that original RDD will be read from the cache this time.

Upvotes: 3

Related Questions