Reputation: 3201
I have Apache Spark
master node. When I try to iterate throught RDDs Spark hangs.
Here is an example of my code:
val conf = new SparkConf()
.setAppName("Demo")
.setMaster("spark://localhost:7077")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
val records = sc.textFile("file:///Users/barbara/projects/spark/src/main/resources/videos.csv")
println("Start")
records.collect().foreach(println)
println("Finish")
Spark log says:
Start
16/04/05 17:32:23 INFO FileInputFormat: Total input paths to process : 1
16/04/05 17:32:23 INFO SparkContext: Starting job: collect at Application.scala:23
16/04/05 17:32:23 INFO DAGScheduler: Got job 0 (collect at Application.scala:23) with 2 output partitions
16/04/05 17:32:23 INFO DAGScheduler: Final stage: ResultStage 0 (collect at Application.scala:23)
16/04/05 17:32:23 INFO DAGScheduler: Parents of final stage: List()
16/04/05 17:32:23 INFO DAGScheduler: Missing parents: List()
16/04/05 17:32:23 INFO DAGScheduler: Submitting ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19), which has no missing parents
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 120.5 KB)
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1811.0 B, free 122.3 KB)
16/04/05 17:32:23 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.18.199.187:55983 (size: 1811.0 B, free: 2.4 GB)
16/04/05 17:32:23 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/04/05 17:32:23 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19)
16/04/05 17:32:23 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
I see only a "Start" message. Seems Spark do nothing to read RDDs. Any ideas how to fix it?
UPD
The data I want to read:
123v4n312bv4nb12,Action,Comedy
2n4vhj2gvrh24gvr,Action,Drama
sjfu326gjrw6g374,Drama,Horror
Upvotes: 2
Views: 4420
Reputation: 113
Use this instead:
val bufferedSource = io.Source.fromFile("/path/filename.csv")
for (line <- bufferedSource.getLines) {
println(line)
}
Upvotes: 0
Reputation: 8996
If Spark hands on such a small dataset I would first look for:
Am I trying to connect to a cluster that doesn't respond/exists? If I am trying to connect to a running cluster, I would first try to run the same code locally setMaster("local[*]")
. If this works, I would know that there is something going on with the "master" I try to connect to.
Am I asking for more resources that what the cluster has to offer? For example, if the cluster manages 2G and I ask for a 3GB executor, my application will never get schedule and it will be in the job queue forever.
Specific to the comments above. If you started your cluster by sbin/start-master.sh
you will NOT get a running cluster. At the very minimum you need a master and a worker (for standalone). You should use the start-all.sh
script. I recommend a bit more homework and follow a tutorial.
Upvotes: 2