Spark Code Optimization

Question

My task is to write a code that reads a big file (doesn't fit into memory) reverse it and output most five frequent words .
i have written the code below and it does the job .

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

 object ReverseFile {
  def main(args: Array[String]) {


    val conf = new SparkConf().setAppName("Reverse File")
    conf.set("spark.hadoop.validateOutputSpecs", "false")
    val sc = new SparkContext(conf)
    val txtFile = "path/README_mid.md"
    val txtData = sc.textFile(txtFile)
    txtData.cache()

    val tmp = txtData.map(l => l.reverse).zipWithIndex().map{ case(x,y) => (y,x)}.sortByKey(ascending = false).map{ case(u,v) => v}

    tmp.coalesce(1,true).saveAsTextFile("path/out.md")

    val txtOut = "path/out.md"
    val txtOutData = sc.textFile(txtOut)
    txtOutData.cache()

    val wcData = txtOutData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(ascending = false)
    wcData.collect().take(5).foreach(println)


  }
}

The problem is that i'm new to spark and scala, and as you can see in the code first i read the file reverse it save it then reads it reversed and output the five most frequent words .

Is there a way to tell spark to save tmp and process wcData (without the need to save,open file) at the same time because otherwise its like reading the file twice .
From now on i'm going to tackle with spark a lot, so if there is any part of the code (not like the absolute path name ... spark specific) that you might think could be written better i'de appreciate it.

Reactormonk · Accepted Answer

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object ReverseFile {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("Reverse File")
    conf.set("spark.hadoop.validateOutputSpecs", "false")
    val sc = new SparkContext(conf)
    val txtFile = "path/README_mid.md"
    val txtData = sc.textFile(txtFile)
    txtData.cache()

    val reversed = txtData
      .zipWithIndex()
      .map(_.swap)
      .sortByKey(ascending = false)
      .map(_._2) // No need to deconstruct the tuple.

    // No need for the coalesce, spark should do that by itself.
    reversed.saveAsTextFile("path/reversed.md")

    // Reuse txtData here.
    val wcData = txtData
      .flatMap(_.split(" "))
      .map(word => (word, 1))
      .reduceByKey(_ + _)
      .map(_.swap)
      .sortByKey(ascending = false)

    wcData
      .take(5) // Take already collects.
      .foreach(println)
  }
}

Always do the collect() last, so Spark can evaluate things on the cluster.

Spark Code Optimization

Answers (2)

Related Questions