Reputation: 143
I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code:
"..
textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x)
textRdd.collect().show()
.."
I got this error:
An error was encountered:
Invalid status code '400' from http://..../sessions/4/statements/7 with error payload: {"msg":"requirement failed: Session isn't active."}
Running the line:
sparkDF.show()
works!
I also created a small subset of the file and all my code runs fine.
What is the problem?
Upvotes: 23
Views: 23629
Reputation: 21
Just a restart helped solve this problem for me. On your Jupyter Notebook, go to -->Kernel-->>Restart Once done, if you run the cell with "spark" command you will see that a new spark session gets established.
Upvotes: 2
Reputation: 2816
What worked for me is adding {"Classification": "spark-defaults", "Properties": {"spark.driver.memory": "20G"}} to the EMR configuration.
Upvotes: 0
Reputation: 1711
From This stack overflow question's answer which worked for me
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
1. edit the /etc/livy/conf/livy.conf file (in the cluster's master node)
2. set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
3. restart Livy to update the setting: sudo restart livy-server in the cluster's master
4. test your code again
Alternative way to edit this setting - https://allinonescript.com/questions/54220381/how-to-set-livy-server-session-timeout-on-emr-cluster-boostrap
Upvotes: 9
Reputation: 1544
I had the same issue and the reason for the timeout is the driver running out of memory. Since you run collect()
all the data gets sent to the driver. By default the driver memory is 1000M
when creating a spark application through JupyterHub even if you set a higher value through config.json
. You can see that by executing the code from within a jupyter notebook
spark.sparkContext.getConf().get('spark.driver.memory')
1000M
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.
Upvotes: 22
Reputation: 55
Insufficient reputation to comment.
I tried increasing heartbeat Interval to a much higher (100 seconds), still the same result. FWIW, the error shows up in < 9s.
Upvotes: 0
Reputation: 2865
You might get some insights from this similar Stack Overflow thread: Timeout error: Error with 400 StatusCode: "requirement failed: Session isn't active."
Solution might be to increase spark.executor.heartbeatInterval
. Default is 10 seconds.
See EMR's official documentation on how to change Spark defaults:
You change the defaults in spark-defaults.conf using the spark-defaults configuration classification or the maximizeResourceAllocation setting in the spark configuration classification.
Upvotes: 0