Rajesh P
Rajesh P

Reputation: 41

Python worker failed to connect back in Pyspark or spark Version 2.3.1

After installing anaconda3 and installing spark(2.3.2) I'm trying to run the sample pyspark code.

This is just a sample program I'm running through Jupyter, im getting an error like

Python worker failed to connect back.

As per below question in stack overflow:

Python worker failed to connect back

i can see a solution like this I got the same error. I solved it installing the previous version of Spark (2.3 instead of 2.4). Now it works perfectly, maybe it is an issue of the latest version of pyspark.

But I'm using spark version 2.3.1 and python version is 3.7

still, I'm facing that issue. Please help me to solve this error

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("mySparkApp").getOrCreate()
testData=spark.sparkContext.parallelize([3,8,2,5])
testData.count()

The traceback is:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 6, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

Upvotes: 4

Views: 10327

Answers (3)

Asif Raza
Asif Raza

Reputation: 1021

Please make sure you have properly set the environment variable set,

enter image description here

Upvotes: 1

Ramineni Ravi Teja
Ramineni Ravi Teja

Reputation: 3936

just add environment variable PYSPARK_PYTHON as python. It solves the issue. No need to upgrade or Downgrade Spark version. It worked for me.

enter image description here

Upvotes: 15

Henrique Branco
Henrique Branco

Reputation: 1940

Set your environment variables as follows:

  • PYSPARK_DRIVER_PYTHON=jupyter
  • PYSPARK_DRIVER_PYTHON_OPTS=notebook
  • PYSPARK_PYTHON=python

The heart of the problem is the connection between pyspark and python, solved by changing them.

Upvotes: 5

Related Questions