Reputation: 601
It has been two weeks during which I have been trying to install Spark (pyspark) on my Windows 10 machine, now I realized that I need your help.
When I try to start 'pyspark' in the command prompt, I still receive the following error:
'pyspark' is not recognized as an internal or external command, operable program or batch file.
To me this hints at a problem with the path/environmental variables, but I cannot find the root of the problem.
I have tried multiple tutorials but the best I found was the one by Michael Galarnyk. I followed his tutorial step by step:
Downloaded Spark 2.3.1 (I changed the commands accordingly as Michael's tutorial uses a different version) from the official website. I moved it in line with the tutorial in the cmd prompt:
mv C:\Users\patri\Downloads\spark-2.3.1-bin-hadoop2.7.tgz C:\opt\spark\spark-2.3.1-bin-hadoop2.7.tgz
Then I untarred it:
gzip -d spark-2.3.1-bin-hadoop2.7.tgz
and
tar xvf spark-2.3.1-bin-hadoop2.7.tar
Downloaded Hadoop 2.7.1 from Github:
curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe?raw=true
Set my Environmental Variables accordingly:
setx SPARK_HOME C:\opt\spark\spark-2.3.1-bin-hadoop2.7
setx HADOOP_HOME C:\opt\spark\spark-2.3.1-bin-hadoop2.7
setx PYSPARK_DRIVER_PYTHON jupyter
setx PYSPARK_DRIVER_PYTHON_OPTS notebook
Then added C:\opt\spark\spark-2.3.1-bin-hadoop2.7\bin to my path variables. My environmental user variables now look like this: Current Environmental Variables
These actions should have done the trick, but when I run pyspark --master local[2]
, I still get the error from above. Can you help to track down this error using the information from above?
I ran a couple of checks in the command prompt to verify the following:
Upvotes: 7
Views: 21021
Reputation: 31
Follow the given steps explained in my blog will resolve your problem-
How to Setup PySpark on Windows https://beasparky.blogspot.com/2020/05/how-to-setup-pyspark-in-windows.html
To set up the environment paths for Spark.
Go to "Advanced System Settings" and set below paths
JAVA_HOME="C:\Program Files\Java\jdk1.8.0_181"
HADOOP_HOME="C:\spark-2.4.0-bin-hadoop2.7"
SPARK_HOME="C:\spark-2.4.0-bin-hadoop2.7"
Also, add their bin path into the PATH system variable
Upvotes: 2
Reputation: 314
I resolved this issue by setting the variables as "system variables" rather than "user variables". Note
pyspark master local[2]
(make sure winutils.exe is there); if that does not work then you have other issues than just env variablesUpvotes: 4