BBischof
BBischof

Reputation: 310

Iterating over a pair RDD to run a function on an RDD from the second value.

I'm having an issue with a function(which I can't modify) needing an RDD as input, but my data is in such a format that I can't seem to get just an RDD into the function.

Consider an RDD that was created by a groupby such that it consists of ("name", data) pairs, called coolRdd. The data is an Iterable[String], and the name is a String. However, I need to run CoolFunction on it, which takes type (Rdd[String], String). Here was my attempt:

coolRdd.foreach{ case (name, data) => sc.CoolFunction(data.toList, name) }

which returns

found   : List[String]
required: org.apache.spark.rdd.RDD[String]

I also tried running sc.parallelize on the data.toList, but that gives a nullPointer because it would create an RDD of RDDs which Spark doesn't allow.

I'm wondering if it's possible to write another function that can do the conversion on data, and then call to the necessary CoolFunction. It would be better if I didn't have to do this on the driver, but if necessary that's doable.

As a bonus: I'm actually doing this with streaming, so this whole mess is going to be in a call to foreachRDD, but I expect that if I can get this working in the normal case, I can make it work in the streaming case.

Upvotes: 0

Views: 1773

Answers (1)

BBischof
BBischof

Reputation: 310

I was able to find a solution:

coolRdd.
collect.
foreach{ case (name, data) => 
 val data_list = data.toList
 sc.coolFunction(sc.parallelize(data_list), pid)
}

Where I was mistaken was failing to collect. Because only the driver knows about RDDs, collect is necessary here.

Upvotes: 0

Related Questions