Reputation: 1973
I have installed pyspark with python 3.6 and I am using jupyter notebook to initialize a spark session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").enableHieSupport.getOrCreate()
which runs without any errors
But I write,
df = spark.range(10)
df.show()
It throws me an error -->
Py4JError: An error occurred while calling o54.showString. Trace:
py4j.Py4JException: Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
I dont know why I am facing this issue.
If I do,
from pyspark import SparkContext
sc = SparkContext()
print(sc.version)
'2.1.0'
Upvotes: 19
Views: 70760
Reputation: 11
try changing pyspark version. worked for me was using 3.2.1 and was getting this err after switching to 3.2.2 it worked perfectly fine.
Upvotes: 1
Reputation: 1
I had the same error when using PyCharm and executing code in the Python Console in Windows 10, however, I was able to run this same code without error when launching pyspark from the terminal. After trying solutions from many searches, the solution for the Pycharm Python Console error was a combination of all of the environment variable (I set them up for both User and System) and PyCharm setting steps in the following two blog posts, setup pyspark locally and spark & pycharm.
Upvotes: 0
Reputation: 11
%env PYTHONPATH=%SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j--src.zip:%PYTHONPATH%
!pip install findspark
import findspark
!pip install pyspark==2.4.4
import pyspark
findspark.init()
from pyspark import SparkConf, SparkContext
sc = pyspark.SparkContext.getOrCreate()
You have to add the paths and add the necessary libraries for Apache Spark.
Upvotes: 1
Reputation: 11
Here’s the steps and combination of tools that worked for me using Jupyter:
1) Install Java 1.8
2) Set Environment Variable in PATH for Java, e.g. JAVA_HOME = C:\Program Files\Java\javasdk_1.8.241
3) Install PySpark 2.7 Using Conda Install (3.0 did not work for me, it gave error asking me to match PySpark and Spark versions...search for Conda Install code for PySpark 2.7
4) Install Spark 2.4 (3.0 did not work for me)
5) Set SPARK_HOME in Environment Variable to the Spark download folder, e.g. SPARK_HOME = C:\Users\Spark
6) Set HADOOP_HOME in Environment Variable to the Spark download folder, e.g. HADOOP_HOME = C:\Users\Spark
7) Download winutils.exe
and place it inside the bin folder in Spark software download folder after unzipping Spark.tgz
8) Install FindSpark in Conda, search for it on Anaconda.org website and install in Jupyter notebook (This was the one of the most important steps to avoid getting an error)
9) Restart computer to make sure Environment Variables are applied
10) You can validate if environment variables are applied by typing below in Windows command prompt:
C:\> echo %SPARK_HOME%
This should show you the environment variable that you have added to Windows PATH in Advanced Settings for Windows 10
Upvotes: 1
Reputation: 41
import findspark
findspark.init("path of SparkORHadoop ")
from pyspark import SparkContext
you need firstly set findspark.init() and then you can import pyspark
Upvotes: 0
Reputation: 1401
I had a similar Constructor [...] does not exist
problem. Then I found the version of PySpark package is not the same as Spark (2.4.4) installed on the server. Finally, I solved the problem by reinstalling PySpark with the same version:
pip install pyspark==2.4.4
Upvotes: 3
Reputation: 1684
For me
import findspark
findspark.init()
import pyspark
solved the problem
Upvotes: 11
Reputation: 1656
If you are using pyspark in anancoda, add below code to set SPARK_HOME before running your codes:
import os
import sys
spark_path = r"spark-2.3.2-bin-hadoop2.7" # spark installed folder
os.environ['SPARK_HOME'] = spark_path
sys.path.insert(0, spark_path + "/bin")
sys.path.insert(0, spark_path + "/python/pyspark/")
sys.path.insert(0, spark_path + "/python/lib/pyspark.zip")
sys.path.insert(0, spark_path + "/python/lib/py4j-0.10.7-src.zip")
Upvotes: 4
Reputation: 120
I just needed to set the SPARK_HOME
environment variable to the location of spark. I added the following lines to my ~/.bashrc
file.
# SPARK_HOME
export SPARK_HOME="/home/pyuser/anaconda3/lib/python3.6/site-packages/pyspark/"
SInce I am using different versions of spark in different environments, I followed this tutorial (link) to create environment variables for each conda enviroment.
Upvotes: 3
Reputation: 597
I am happy now because I have been having exactly the same issue with my pyspark and I found "the solution". In my case, I am running on Windows 10. After many searches via Google, I found the correct way of setting the required environment variables:
PYTHONPATH=$SPARK_HOME$\python;$SPARK_HOME$\python\lib\py4j-<version>-src.zip
The version of Py4J source package changes between the Spark versions, thus, check what you have in your Spark and change the placeholder accordingly.
For a complete reference to the process look at this site: how to install spark locally
Upvotes: 17
Reputation: 15283
I think spark.range
is supposed to return a RDD object. Therefore, show
is not a method you can use. Please use instead collect
or take
.
You can also replace spark.range
with sc.range
if you want to use show
.
Upvotes: 0