Reputation: 373
This seems like a really naive question but I can't find a straight answer anywhere.
I'm using Spark RDDs to convert a very large TSV file into two sets of key-value pairs to be loaded into a distributed key-value store. I'm not using DataFrames because the TSV doesn't follow a very well-defined schema and a sparse matrix is a better model for it.
One set of key-value pairs represents the original data in an Entity-Attribute-Value model and the other set transposes the keys and values from the first set into an Attibute-Value-Entity model(?) I guess - I just made that term up.
My pseudocode is roughly,
val orig: RDD[String] = sc.textFile("hdfs:///some-file.tsv").cache
val entityAttrPairs = orig.mapPartitions(convertLinesToKVPairs)
val attrEntityPairs = orig.mapPartitions(convertLinesToIndexKVPairs)
entityAttrPairs.saveAsNewAPIHadoopFile("hdfs:///ready-for-ingest/entity-attr")
attrEntityPairs.saveAsNewAPIHadoopFile("hdfs:///ready-for-ingest/attr-entity")
My question is: will the separate calls to mapPartitions
cause Spark to iterate over the entire RDD twice? Would I be better off trying to produce both the entity-attr and attr-entity pairs in a single pass through the RDD, even though it will make the code much less readable?
Upvotes: 0
Views: 345
Reputation: 67075
Yes and no. Since the base RDD is cached then the first map will load it and put it into memory. The second map will require a new iteration as it is a separate branch off of the original RDD. However, that original RDD will be read from the cache this time.
Upvotes: 3