Reputation: 333

Python worker failed to connect back

I'm trying to complete this Spark tutorial.

After installing Spark on local machine (Win10 64, Python 3, Spark 2.4.0) and setting all env variables (HADOOP_HOME, SPARK_HOME, etc) I'm trying to run a simple WordCount.py Spark application:

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":
    conf = SparkConf().setAppName("word count").setMaster("local[2]")
    sc = SparkContext(conf = conf)

    lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/word_count.text")
    words = lines.flatMap(lambda line: line.split(" "))
    wordCounts = words.countByValue()

    for word, count in wordCounts.items():
        print("{} : {}".format(word, count))

After running it from the command line:

spark-submit WordCount.py

I get below error. I checked (by commenting out line by line) that it crashes at

wordCounts = words.countByValue()

Any idea what should I check to make it work?

Traceback (most recent call last):
  File "C:\Users\mjdbr\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\mjdbr\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 25, in <module>
ModuleNotFoundError: No module named 'resource'
18/11/10 23:16:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
        at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
        at java.net.PlainSocketImpl.accept(Unknown Source)
        at java.net.ServerSocket.implAccept(Unknown Source)
        at java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
        ... 14 more
18/11/10 23:16:58 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/rdd/WordCount.py", line 19, in <module>
    wordCounts = words.countByValue()
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 1261, in countByValue
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 844, in reduce
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 816, in collect
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
        at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
        at java.net.PlainSocketImpl.accept(Unknown Source)
        at java.net.ServerSocket.implAccept(Unknown Source)
        at java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
        ... 14 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        ... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
        at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
        at java.net.PlainSocketImpl.accept(Unknown Source)
        at java.net.ServerSocket.implAccept(Unknown Source)
        at java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
        ... 14 more

As suggested by theplatypus - checked if the 'resource' module can be imported directly from terminal - apparently not:

>>> import resource
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'resource'

In terms of installation resources - I followed instructions from this tutorial:

downloaded spark-2.4.0-bin-hadoop2.7.tgz from Apache Spark website
un-zipped it to my C-drive
already had Python_3 installed (Anaconda distribution) as well as Java
created local 'C:\hadoop\bin' folder to store winutils.exe
created 'C:\tmp\hive' folder and gave Spark access to it
added environment variables (SPARK_HOME, HADOOP_HOME etc)

Is there any extra resource I should install?

Upvotes: 29

Answers (13)

SadanandM

Reputation: 59

Set Env PYSPARK_PYTHON=python To Fix It.

e.g.

Upvotes: 1

Hasintha Abeykoon

Reputation: 524

Answer:

You need to set below environment variable.

PYSPARK_PYTHON=%PYTHON_PATH% (if you've set PYTHON_PATH)

PYSPARK_PYTHON=<your_python_installation>

Explnation:

I had the same problem as above, something like Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases..

With other answers & some reading through the docs I figured I needed to add another Environment variable for Spark to be able to recognize Python executable, Apparently Spark reads it's own environment variable to execute python. More info here.

Note: As per my understanding only required one is PYSPARK_PYTHON assuming Java, Python & Hadoop are added to path.

My Machine Specs & Versions I have used:

Edition Windows 11 Home Single Language
Version 23H2
Installed on    ‎2022-‎10-‎10
OS build    22631.3296
Experience  Windows Feature Experience Pack 1000.22687.1000.0

Python: 3.11
Spark: 3.5.1
Java: 1.8 (8)
winutils 3.3.5 Download from here

Some background information

(Not related to this issue but someone out there might need this) Initially I had problems initializing sparkContext too & that was because I haven't got all the necessary files of the winutils in place. You need to download the entire directory of specific winutils. Initially I only had the winutils.exe file & it didn't had the hadoop cmd & other binaries. I have used the latest as of now (3.3.5). Download this & set env variable HADOOP_HOME to the folder & add %HADOOP_HOME%/bin to path. Then you are good to go.

Also there are great articles on setting up Spark on Windows step by step if you are all new to this.

Upvotes: 0

Rohan KUMAR

Reputation: 1

In my case it solved it by using these commands before initialing the session :

import findspark

findspark.init("path/to/spark")

Upvotes: 0

Kelum Sampath Edirisinghe

Reputation: 1268

The reason for this error is the connection between pyspark and python not being established correctly.

Solution: set PYSPARK_PYTHON environment variable to python.

Note: make sure to restart your cmd or shell after adding environment variables

Upvotes: 0

Carlos Perez Haurie

Reputation: 11

I had to set:

HADOOP_HOME = [..]\winutils\hadoop-3.0.0
PATH = [..]\winutils\hadoop-3.0.0\bin
PYSPARK_PYTHON = python

You can find winutils here. I use Hadoop 3.0.0 and PySpark 3.2.1.

Upvotes: 1

Nikhil Agarwal

Reputation: 191

I had the same issue. I had set all the environment variables correctly but still wasn't able to resolve it

In my case,

import findspark
findspark.init()

adding this before even creating the sparkSession helped.

I was using Visual Studio Code on Windows 10 and spark version was 3.2.0. Python version is 3.9 .

Note: Initially check if the paths for HADOOP_HOME SPARK_HOME PYSPARK_PYTHON have been set correctly

Upvotes: 19

Laenka-Oss

Reputation: 994

There seems to be many reasons for this error to occur. I still had the same problem despite my environmental variables being all correctly set.

In my case adding this

import findspark
findspark.init()

solved the problem.

I'am withjupyter notebook on windows10 with spark-3.1.2, python3.6.

Upvotes: 0

Ashvin Anto

Reputation: 11

When you run the python installer, on the Customize Python section, make sure that the option Add python.exe to Path is selected. If this option is not selected, some of the PySpark utilities such as pyspark and spark-submit might not work. This worked for me! Happy Sharing :)

Upvotes: 1

帝国阿三

Reputation: 349

Set Env PYSPARK_PYTHON=python To Fix It.

Upvotes: 34

Henrique Branco

Reputation: 1940

The heart of the problem is the connection between pyspark and python, solved by redefining the environment variable.

I´ve just changed the environment variable's values PYSPARK_DRIVER_PYTHON from ipython to jupyter and PYSPARK_PYTHON from python3 to python.

Now I'm using Jupyter Notebook, Python 3.7, Java JDK 11.0.6, Spark 2.4.2

Upvotes: 16

Erkan Şirin

Reputation: 2095

Downgrading Spark back to 2.3.2 from 2.4.0 was not enough for me. I don't know why but in my case I had to create SparkContext from SparkSession like

sc = spark.sparkContext

Then the very same error disappeared.

Upvotes: 3

Raf

Reputation: 216

I got the same error. I solved it installing the previous version of Spark (2.3 instead of 2.4). Now it works perfectly, maybe it is an issue of the lastest version of pyspark.

Upvotes: 19

theplatypus

Reputation: 101

Looking at the source of the error (worker.py#L25), it seems that the python interpreter used to instanciate a pyspark worker doesn't have access to the resource module, a built-in module referred in Python's doc as part of "Unix Specific Services".

Are you sure you can run pyspark on Windows (without some additional software like GOW or MingW at least), and so that you didn't skip some Windows-specific installation steps ?

Could you open a python console (the one used by pyspark) and see if you can >>> import resource without getting the same ModuleNotFoundError ? If you don't, then could you provide the ressources you used to install it on W10 ?

Upvotes: 1

Python worker failed to connect back

Answers (13)

Related Questions