Kundan Kumar
Kundan Kumar

Reputation: 2002

org.apache.spark.shuffle.FetchFailedException

I am running this query on a data size of 4 billion rows and getting

org.apache.spark.shuffle.FetchFailedException error.

select adid,position,userid,price
from (
select adid,position,userid,price,
dense_rank() OVER (PARTITION BY adlocationid ORDER BY price DESC) as rank
FROM trainInfo) as tmp
WHERE rank <= 2

I have attached the error logs from spark-sql terminal.Please suggest what is the reason for these kind of errors and how can I resolve them.

error logs

Upvotes: 0

Views: 1840

Answers (1)

Iulian Dragos
Iulian Dragos

Reputation: 5712

The problem is that you lost an executor:

15/08/25 10:08:13 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 165758 ms exceeds timeout 120000 ms
15/08/25 10:08:13 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.1.223: Executor heartbeat timed out after 165758 ms

The exception occurs when trying to read shuffle data from that node. The node may be doing a very long GC (maybe try using a smaller heap size for executors), or network failure, or a pure crash. Normally Spark should recover from lost nodes like this one, and indeed it starts resubmitting the first stage to another node. Depending how big is your cluster, it may succeed or not.

Upvotes: 2

Related Questions