Reputation: 1859
val test= sc.textFile(12,logFile).cache()
In the above code snippet, I am trying to make apache spark to parallelize reading a huge text file. How do i store the contents of this onto a string ?
I was earlier doing this to read
val lines = scala.io.Source.fromFile(logFile, "utf-8").getLines.mkString
but then now i am trying to make the read faster using spark context.
Upvotes: 1
Views: 2540
Reputation: 17431
Reading the file into a String through Spark is very unlikely to be faster than reading it directly - to work efficiently in Spark you should keep everything in RDD form and do your processing that way, only reducing down to a (small) value at the end. Reading it in Spark just means you'll read it into memory locally, serialize the chunks and send them out to your cluster nodes, then serialize them again to send them back to your local machine and gather them together. Spark is a powerful tool but it's not magical; it can only parallelize operations that are actually parallel. (Do you even know that reading the file into memory is the bottleneck? Always benchmark before optimizing)
But to answer your question, you could use
lines.toLocalIterator.mkString
Just don't expect it to be any faster than reading the file locally.
Upvotes: 3
Reputation: 6764
Collect the values, and then iterate them:
var string = ""
test.collect.foreach({i => string += i} )
Upvotes: 0