Siva
Siva

Reputation: 1859

apache spark textfile to a string

val test= sc.textFile(12,logFile).cache()

In the above code snippet, I am trying to make apache spark to parallelize reading a huge text file. How do i store the contents of this onto a string ?

I was earlier doing this to read

val lines = scala.io.Source.fromFile(logFile, "utf-8").getLines.mkString

but then now i am trying to make the read faster using spark context.

Upvotes: 1

Views: 2540

Answers (2)

lmm
lmm

Reputation: 17431

Reading the file into a String through Spark is very unlikely to be faster than reading it directly - to work efficiently in Spark you should keep everything in RDD form and do your processing that way, only reducing down to a (small) value at the end. Reading it in Spark just means you'll read it into memory locally, serialize the chunks and send them out to your cluster nodes, then serialize them again to send them back to your local machine and gather them together. Spark is a powerful tool but it's not magical; it can only parallelize operations that are actually parallel. (Do you even know that reading the file into memory is the bottleneck? Always benchmark before optimizing)

But to answer your question, you could use

lines.toLocalIterator.mkString

Just don't expect it to be any faster than reading the file locally.

Upvotes: 3

griffon vulture
griffon vulture

Reputation: 6764

Collect the values, and then iterate them:

  var string = ""
  test.collect.foreach({i => string += i} )

Upvotes: 0

Related Questions