Reputation: 1973

Py4J error when creating a spark dataframe using pyspark

I have installed pyspark with python 3.6 and I am using jupyter notebook to initialize a spark session.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").enableHieSupport.getOrCreate()

which runs without any errors

But I write,

df = spark.range(10)
df.show()

It throws me an error -->

Py4JError: An error occurred while calling o54.showString. Trace:
py4j.Py4JException: Method showString([class java.lang.Integer, class java.lang.Integer, class java.lang.Boolean]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:272)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)

I dont know why I am facing this issue.

If I do,

from pyspark import SparkContext
sc = SparkContext()
print(sc.version)

'2.1.0'

Upvotes: 19

Answers (11)

MAYANK GUPTA

Reputation: 11

try changing pyspark version. worked for me was using 3.2.1 and was getting this err after switching to 3.2.2 it worked perfectly fine.

Upvotes: 1

brd033

Reputation: 1

I had the same error when using PyCharm and executing code in the Python Console in Windows 10, however, I was able to run this same code without error when launching pyspark from the terminal. After trying solutions from many searches, the solution for the Pycharm Python Console error was a combination of all of the environment variable (I set them up for both User and System) and PyCharm setting steps in the following two blog posts, setup pyspark locally and spark & pycharm.

Upvotes: 0

Ayse ILKAY

Reputation: 11

 %env PYTHONPATH=%SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j--src.zip:%PYTHONPATH%
!pip install findspark
import findspark 
!pip install pyspark==2.4.4 
import pyspark 
findspark.init() 
from pyspark import SparkConf, SparkContext
sc = pyspark.SparkContext.getOrCreate()

You have to add the paths and add the necessary libraries for Apache Spark.

Upvotes: 1

vrazer_learn

Reputation: 11

Here’s the steps and combination of tools that worked for me using Jupyter:

1) Install Java 1.8

2) Set Environment Variable in PATH for Java, e.g. JAVA_HOME = C:\Program Files\Java\javasdk_1.8.241

3) Install PySpark 2.7 Using Conda Install (3.0 did not work for me, it gave error asking me to match PySpark and Spark versions...search for Conda Install code for PySpark 2.7

4) Install Spark 2.4 (3.0 did not work for me)

5) Set SPARK_HOME in Environment Variable to the Spark download folder, e.g. SPARK_HOME = C:\Users\Spark

6) Set HADOOP_HOME in Environment Variable to the Spark download folder, e.g. HADOOP_HOME = C:\Users\Spark

7) Download winutils.exe and place it inside the bin folder in Spark software download folder after unzipping Spark.tgz

8) Install FindSpark in Conda, search for it on Anaconda.org website and install in Jupyter notebook (This was the one of the most important steps to avoid getting an error)

9) Restart computer to make sure Environment Variables are applied

10) You can validate if environment variables are applied by typing below in Windows command prompt:

C:\> echo %SPARK_HOME%

This should show you the environment variable that you have added to Windows PATH in Advanced Settings for Windows 10

Upvotes: 1

Zekeriyya Demirci

Reputation: 41

import findspark
findspark.init("path of SparkORHadoop ")
from pyspark import SparkContext

you need firstly set findspark.init() and then you can import pyspark

Upvotes: 0

micmia

Reputation: 1401

I had a similar Constructor [...] does not exist problem. Then I found the version of PySpark package is not the same as Spark (2.4.4) installed on the server. Finally, I solved the problem by reinstalling PySpark with the same version:

pip install pyspark==2.4.4

Upvotes: 3

Zin Yosrim

Reputation: 1684

For me

import findspark
findspark.init()

import pyspark

solved the problem

Upvotes: 11

Tung Nguyen

Reputation: 1656

If you are using pyspark in anancoda, add below code to set SPARK_HOME before running your codes:

import os
import sys
spark_path = r"spark-2.3.2-bin-hadoop2.7" # spark installed folder
os.environ['SPARK_HOME'] = spark_path
sys.path.insert(0, spark_path + "/bin")
sys.path.insert(0, spark_path + "/python/pyspark/")
sys.path.insert(0, spark_path + "/python/lib/pyspark.zip")
sys.path.insert(0, spark_path + "/python/lib/py4j-0.10.7-src.zip")

Upvotes: 4

GeneticsGuy

Reputation: 120

I just needed to set the SPARK_HOME environment variable to the location of spark. I added the following lines to my ~/.bashrc file.

# SPARK_HOME
export SPARK_HOME="/home/pyuser/anaconda3/lib/python3.6/site-packages/pyspark/"

SInce I am using different versions of spark in different environments, I followed this tutorial (link) to create environment variables for each conda enviroment.

Upvotes: 3

user_dhrn

Reputation: 597

I am happy now because I have been having exactly the same issue with my pyspark and I found "the solution". In my case, I am running on Windows 10. After many searches via Google, I found the correct way of setting the required environment variables: PYTHONPATH=$SPARK_HOME$\python;$SPARK_HOME$\python\lib\py4j-<version>-src.zip The version of Py4J source package changes between the Spark versions, thus, check what you have in your Spark and change the placeholder accordingly. For a complete reference to the process look at this site: how to install spark locally

Upvotes: 17

Steven

Reputation: 15283

I think spark.range is supposed to return a RDD object. Therefore, show is not a method you can use. Please use instead collect or take.

You can also replace spark.range with sc.range if you want to use show.

Upvotes: 0

Py4J error when creating a spark dataframe using pyspark

Answers (11)

Related Questions