sclee1
sclee1

Reputation: 1281

How to debug the apache spark when it stuck at the certain line?

I am asking a question about Apache Spark. It stuck at the certain point as shown in below.

18/11/05 17:03:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.3.33:53082 (size: 3.4 MB, free: 634.0 MB)
18/11/05 17:03:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.3.36:46005 (size: 3.4 MB, free: 634.0 MB)
18/11/05 17:03:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.3.36:41989 (size: 3.4 MB, free: 634.0 MB)
18/11/05 17:03:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.3.35:43500 (size: 3.4 MB, free: 634.0 MB)
18/11/05 17:03:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.3.35:47872 (size: 3.4 MB, free: 406.7 MB)
18/11/05 17:03:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.3.35:34693 (size: 3.4 MB, free: 634.0 MB)
18/11/05 17:03:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.3.36:38656 (size: 3.4 MB, free: 634.0 MB)
18/11/05 17:03:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.3.35:37369 (size: 3.4 MB, free: 634.0 MB)

It doesn't move on to the next step even the time goes. When I gave the small dataset, the procedure goes well. However when the large dataset is given, it always stuck at the above point. I think that it maybe the memory issue, but I am not sure the detail reason. In this case, how to drill down or investigate the reason why the progress was too slow?

I attached the spark job script to help you understand the situation.

spark-submit \
        --class com.bistel.test.IMTestDataSet \
        --master spark://spark.dso.spkm1:7077 \
        --driver-cores 2 \
        --driver-memory 4g \
        --executor-memory 2500m \
        --num-executors 8 \
        --executor-cores 1 \
        /home/jumbo/user/sclee/dataset/jar/dataset.debug.1.jar\
        /user/sclee/dataset/parquet/cause/500000 /user/sclee/dataset/effect/

Upvotes: 3

Views: 6646

Answers (1)

Prashant
Prashant

Reputation: 772

You have two options to explore from here. Given the fact that very little information is provided about the code, setup etc, I will take the liberty to assume that the code is written in Scala and you are running Spark 2 or above.

In you scala code you can put Log4j statements to do logging during spark execution. Logs can then be collected from cluster.

Since your execution is stuck, you need to check the Spark Web UI and drill down from Job > Stages > Tasks and try and figure out what is causing things to stuck.

Some of the generic questions asked are: a. How many executors are running b. Is there a stage/task that is getting re-created after failure. c. Is there a memory contention. d. Does Garbage collection taking just too long to finish. e. How much time it is expected to take. f. Is server having enough CPU and memory

Hope it helps to some extent.

Upvotes: 3

Related Questions