Reputation: 384

pyspark ImportError: cannot import name accumulators

Goal: I am trying to get apache-spark pyspark to be appropriately interpreted within my pycharm IDE.

Problem: I currently receive the following error:

ImportError: cannot import name accumulators

I was following the following blog to help me through the process. http://renien.github.io/blog/accessing-pyspark-pycharm/

Due to the fact my code was taking the except path I personally got rid of the try: except: just to see what the exact error was.

Prior to this I received the following error:

ImportError: No module named py4j.java_gateway

This was fixed simply by typing '$sudo pip install py4j' in bash.

My code currently looks like the following chunk:

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="[MY_HOME_DIR]/spark-1.2.0"

# Append pyspark to Python Path
sys.path.append("[MY_HOME_DIR]/spark-1.2.0/python/")

try:
    from pyspark import SparkContext
    print ("Successfully imported Spark Modules")

except ImportError as e:
    print ("Can not import Spark Modules", e)
    sys.exit(1)

My Questions:
1. What is the source of this error? What is the cause? 2. How do I remedy the issue so I can run pyspark in my pycharm editor.

NOTE: The current interpreter I use in pycharm is Python 2.7.8 (~/anaconda/bin/python)

Thanks ahead of time!

Don

Upvotes: 5

Answers (11)

Hari Krishnan

Reputation: 1166

Only thing worked out for me is, go to base folder of spark. then go to accumulators.py

In beginning, there was wrong multi line command used. remove everything.

you're good to go!

Upvotes: 0

sono

Reputation: 326

I came across the same error. I just installed py4j.

sudo pip install py4j

No necessity to set bashrc.

Upvotes: 3

Shuai.Z

Reputation: 386

Firstly, set your environment var

export SPARK_HOME=/home/.../Spark/spark-2.0.1-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
PATH="$PATH:$JAVA_HOME/bin:$SPARK_HOME/bin:$PYTHONPATH"

make sure that you use your own version name

and then, restart! it is important to validate you setting.

Upvotes: 1

architectonic

Reputation: 3129

If you have just upgraded to a new spark version, make sure the new version of py4j is in your PATH since each new spark version comes with a new py4j version.

In my case it is: "$SPARK_HOME/python/lib/py4j-0.10.3-src.zip" for spark 2.0.1 instead of the old "$SPARK_HOME/python/lib/py4j-0.10.1-src.zip" for spark 2.0.0

Upvotes: 0

Karang

Reputation: 121

To get rid of **ImportError: No module named py4j.java_gateway** you need to add following lines 

import os
import sys


os.environ['SPARK_HOME'] = "D:\python\spark-1.4.1-bin-hadoop2.4"


sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python")
sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf

    print ("success")

except ImportError as e:
    print ("error importing spark modules", e)
    sys.exit(1)

Upvotes: 1

shubham gorde

Reputation: 175

In Pycharm, before running above script, ensure that you have unzipped the py4j*.zip file. and add its reference in script sys.path.append("path to spark*/python/lib")

It worked for me.

Upvotes: 1

Murali

Reputation: 76

I was able to find a fix for this on Windows, but not really sure the root cause of it.

If you open accumulators.py, then you see that first there is a header comment, followed by help text and then the import statements. move one or more of the import statements just after the comment block and before the help text. This worked on my system and i was able to import pyspark without any issues.

Upvotes: 0

Razi Shaban

Reputation: 512

I ran into this issue as well. To solve it, I commented out line 28 in ~/spark/spark/python/pyspark/context.py, the file which was causing the error:

# from pyspark import accumulators
from pyspark.accumulators import Accumulator

As the accumulator import seems to be covered by the following line (29), there doesn't seem to be an issue. Spark is now running fine (after pip install py4j).

Upvotes: 1

ben.ko

Reputation: 81

It is around the variable PYTHONPATH, which specifies python module searching path.

Cause mostly pyspark runs well, you could refer to the shell script pyspark, and see the PYTHONPATH setting is like as below.

PYTHONPATH=/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark/python.

My environment is Cloudera Qickstart VM 5.3.

Hope this helps.

Upvotes: 8

user1136149

Reputation: 251

I ran into the same issue using cdh 5.3

in the end this actually turned out to be pretty easy to resolve. I noticed that the script /usr/lib/spark/bin/pyspark has variables defined for ipython

I installed anaconda to /opt/anaconda

export PATH=/opt/anaconda/bin:$PATH
#note that the default port 8888 is already in use so I used a different port
export IPYTHON_OPTS="notebook --notebook-dir=/home/cloudera/ipython-notebook --pylab inline --ip=* --port=9999"

then finally....

executed

/usr/bin/pyspark

which now functions as expected.

Upvotes: 1

matt2000

Reputation: 1073

This looks to me like a circular-dependency bug.

In MY_HOME_DIR]/spark-1.2.0/python/pyspark/context.py remove or comment-out the line

from pyspark import accumulators.

It's about 6 lines of code from the top.

I filed an issue with the Spark project here:

https://issues.apache.org/jira/browse/SPARK-4974

Upvotes: 4

pyspark ImportError: cannot import name accumulators

Answers (11)

Related Questions