barbara
barbara

Reputation: 3201

Spark hangs during RDD reading

I have Apache Spark master node. When I try to iterate throught RDDs Spark hangs.

Here is an example of my code:

val conf = new SparkConf()
      .setAppName("Demo")
      .setMaster("spark://localhost:7077")
      .set("spark.executor.memory", "1g")

val sc = new SparkContext(conf)

val records = sc.textFile("file:///Users/barbara/projects/spark/src/main/resources/videos.csv")    
println("Start")   

records.collect().foreach(println)    

println("Finish")

Spark log says:

Start
16/04/05 17:32:23 INFO FileInputFormat: Total input paths to process : 1
16/04/05 17:32:23 INFO SparkContext: Starting job: collect at Application.scala:23
16/04/05 17:32:23 INFO DAGScheduler: Got job 0 (collect at Application.scala:23) with 2 output partitions
16/04/05 17:32:23 INFO DAGScheduler: Final stage: ResultStage 0 (collect at Application.scala:23)
16/04/05 17:32:23 INFO DAGScheduler: Parents of final stage: List()
16/04/05 17:32:23 INFO DAGScheduler: Missing parents: List()
16/04/05 17:32:23 INFO DAGScheduler: Submitting ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19), which has no missing parents
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 120.5 KB)
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1811.0 B, free 122.3 KB)
16/04/05 17:32:23 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.18.199.187:55983 (size: 1811.0 B, free: 2.4 GB)
16/04/05 17:32:23 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/04/05 17:32:23 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19)
16/04/05 17:32:23 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks

I see only a "Start" message. Seems Spark do nothing to read RDDs. Any ideas how to fix it?

UPD

The data I want to read:

123v4n312bv4nb12,Action,Comedy
2n4vhj2gvrh24gvr,Action,Drama
sjfu326gjrw6g374,Drama,Horror

Upvotes: 2

Views: 4420

Answers (2)

AliSafari186
AliSafari186

Reputation: 113

Use this instead:

val bufferedSource = io.Source.fromFile("/path/filename.csv")

    for (line <- bufferedSource.getLines) {
        println(line)
    }

Upvotes: 0

marios
marios

Reputation: 8996

If Spark hands on such a small dataset I would first look for:

  • Am I trying to connect to a cluster that doesn't respond/exists? If I am trying to connect to a running cluster, I would first try to run the same code locally setMaster("local[*]"). If this works, I would know that there is something going on with the "master" I try to connect to.

  • Am I asking for more resources that what the cluster has to offer? For example, if the cluster manages 2G and I ask for a 3GB executor, my application will never get schedule and it will be in the job queue forever.


Specific to the comments above. If you started your cluster by sbin/start-master.sh you will NOT get a running cluster. At the very minimum you need a master and a worker (for standalone). You should use the start-all.sh script. I recommend a bit more homework and follow a tutorial.

Upvotes: 2

Related Questions