Iva
Iva

Reputation: 367

Spark-GraphX: create anRDD from an ArrayBuffer of String

I have an ArrayBuffer of Strings that contains the labels of all the vertices of the graph I want to create. I need to create a RDD object [(VertexId, String)] which are going to be the nodes for my future graph, where VertexId for each node = index of the node's label in the ArrayBuffer. I found only information about creating an RDD using SparkContext.textFile(String fname), but nothing on how to create RDD from datastructures.

Is there a way to do this or do I always have to create the RDD from a file?

Upvotes: 1

Views: 1581

Answers (1)

eliasah
eliasah

Reputation: 40370

What you are looking for is the parallelize method:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

So considering your ArrayBuffer[(VertexId,String)], you'll need to transform that into a Seq before and then passing it as an argument to sc.parallelize

According the ArrayBuffer scaladoc you can apply the method toSeq on your arraybuffer directly.

val distData = sc.parallelize(data.toSeq) // data your arraybuffer.

If your arraybuffer is like described in the question of type ArrayBuffer[(VertedId,String)] , distData will be an RDD[(VertedId,String)]

Upvotes: 1

Related Questions