Reputation: 4920

How to parallelize list iteration and be able to create RDDs in Spark?

I've just started learning Spark and Scala.

From what I understand it's bad practice to use collect, because it gathers the whole data in memory and it's also bad practice to use for, because the code inside the block is not executed concurrently by more than one node.

Now, I have a List of numbers from 1 to 10:

List(1,2,3,4,5,6,7,8,9,10)

and for each of these values I need to generate a RDD using this value.

in such cases, how can I generate the RDD?

By doing

sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).map(number => generate_rdd(number))

I get an error because RDD cannot be generated inside another RDD.

What is the best workaround to this problem?

Upvotes: 0

Answers (2)

balaudt

Reputation: 91

Assuming that the number of RDDs that you would like to create would be lower and hence that parallelization itself need not be accomplished by RDD, we can use Scala's parallel collections instead. For example, I tried to count the number of lines in about 40 HDFS files simultaneously using the following piece of code [Ignore the setting of delimiter. For newline delimited texts, this could have well been replaced by sc.textFile]:

val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "~^~")
val parSeq = List("path of file1.xsv","path of file2.xsv",...).par
parSeq.map(x => {
  val rdd = sc.newAPIHadoopFile(x, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
  println(rdd.count())
})

Here is part of the output in Spark UI. As seen, most of the RDD count operations started at the same time.

Upvotes: 1

Mustafa Simav

Reputation: 1019

Assuming generate_rdd defined like def generate_rdd(n: Int): RDD[Something] what you need is flatMap instead of map.

sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).flatMap(number => generate_rdd(number))

This will give a RDD that is a concatenation of all RDDs that are created for numbers from 1 to 10.

Upvotes: 2

How to parallelize list iteration and be able to create RDDs in Spark?

Answers (2)

Related Questions