Reputation: 2002
I am running this query on a data size of 4 billion rows and getting
org.apache.spark.shuffle.FetchFailedException error.
select adid,position,userid,price
from (
select adid,position,userid,price,
dense_rank() OVER (PARTITION BY adlocationid ORDER BY price DESC) as rank
FROM trainInfo) as tmp
WHERE rank <= 2
I have attached the error logs from spark-sql terminal.Please suggest what is the reason for these kind of errors and how can I resolve them.
Upvotes: 0
Views: 1840
Reputation: 5712
The problem is that you lost an executor:
15/08/25 10:08:13 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 165758 ms exceeds timeout 120000 ms
15/08/25 10:08:13 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.1.223: Executor heartbeat timed out after 165758 ms
The exception occurs when trying to read shuffle data from that node. The node may be doing a very long GC (maybe try using a smaller heap size for executors), or network failure, or a pure crash. Normally Spark should recover from lost nodes like this one, and indeed it starts resubmitting the first stage to another node. Depending how big is your cluster, it may succeed or not.
Upvotes: 2