Reputation: 367
I have an ArrayBuffer of Strings that contains the labels of all the vertices of the graph I want to create. I need to create a RDD object [(VertexId, String)]
which are going to be the nodes for my future graph, where VertexId for each node = index of the node's label in the ArrayBuffer.
I found only information about creating an RDD using SparkContext.textFile(String fname)
, but nothing on how to create RDD from datastructures.
Is there a way to do this or do I always have to create the RDD from a file?
Upvotes: 1
Views: 1581
Reputation: 40370
What you are looking for is the parallelize method:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
So considering your ArrayBuffer[(VertexId,String)], you'll need to transform that into a Seq before and then passing it as an argument to sc.parallelize
According the ArrayBuffer scaladoc you can apply the method toSeq on your arraybuffer directly.
val distData = sc.parallelize(data.toSeq) // data your arraybuffer.
If your arraybuffer is like described in the question of type ArrayBuffer[(VertedId,String)]
, distData
will be an RDD[(VertedId,String)]
Upvotes: 1