configuring pycharm IDE for pyspark - first script exception

Question

I have configured pyspark in pycharm ide(on windows) & while executing a simple program it throws exception. But the same program works fine in pyspark shell. I think I'm missing some configuration in pychram. Could someone help me fixing the issue. Below are the details,

Code:

from pyspark import SparkConf, SparkContext
import collections

conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
sc = SparkContext(conf=conf)

lines = sc.textFile("C:\documents\ml-100k\u.data")
ratings = lines.map(lambda x: x.split()[2])
result = ratings.countByValue()

sortedResults = collections.OrderedDict(sorted(result.items()))
for key, value in sortedResults.items():
    print("%s %i" % (key, value))

Exception:

Traceback (most recent call last):
  File "H:/Mine/OneDrive/Python/Python01/ratings-counter.py", line 5, in 
    sc = SparkContext(conf=conf)
  File "C:\spark\python\pyspark\context.py", line 115, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "C:\spark\python\pyspark\context.py", line 259, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "C:\spark\python\pyspark\java_gateway.py", line 80, in launch_gateway
    proc = Popen(command, stdin=PIPE, env=env)
  File "C:\Python27\Lib\subprocess.py", line 711, in __init__
    errread, errwrite)
  File "C:\Python27\Lib\subprocess.py", line 948, in _execute_child
    startupinfo)
WindowsError: [Error 2] The system cannot find the file specified

Same script executes fine on pyspark shell & details are below:

C:\Windows\System32>pyspark
Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
17/05/11 13:56:11 WARN NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
17/05/11 13:56:22 WARN ObjectStore: Failed to get database global_temp, returnin
g NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ / _ / _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/

Using Python version 2.7.5 (default, May 15 2013 22:44:16)
SparkSession available as 'spark'.
>>> from pyspark import SparkConf, SparkContext
>>> import collections
>>>
>>> conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
>>> sc = SparkContext(conf=conf)
Traceback (most recent call last):
  File "", line 1, in 
  File "C:\spark\python\pyspark\context.py", line 115, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "C:\spark\python\pyspark\context.py", line 275, in _ensure_initialized
    callsite.function, callsite.file, callsite.linenum))
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app
=PySparkShell, master=local[*]) created by getOrCreate at C:\spark\bin\..\python
\pyspark\shell.py:43
>>>
>>> lines = sc.textFile("C:\documents\ml-100k\u.data")
>>> ratings = lines.map(lambda x: x.split()[2])
>>> result = ratings.countByValue()
[Stage 0:>                                                          (0 + 2) / 2]
[Stage 0:=============================>                             (1 + 1) / 2]

>>>
>>> sortedResults = collections.OrderedDict(sorted(result.items()))
>>> for key, value in sortedResults.items():
...     print("%s %i" % (key, value))
...
1 6110
2 11370
3 27145
4 34174
5 21201
>>>

Hari · Accepted Answer

You need to configure pycharm to use SDK as python from spark rather than python installation on your machine. It seems your code is picking python 2.7 installed.

Create Run configuration:

Go to Run -> Edit configurations
Add new Python configuration
Set Script path so it points to the script you want to execute
Edit Environment variables field so it contains at least:
SPARK_HOME - it should point to the directory with Spark installation. It should contain directories such as bin (with spark-submit, spark-shell, etc.) and conf (with spark-defaults.conf, spark-env.sh, etc.)
PYTHONPATH - it should contain $SPARK_HOME/python and optionally $SPARK_HOME/python/lib/py4j-some-version.src.zip if not available otherwise. some-version should match Py4J version used by a given Spark installation (0.8.2.1 - 1.5, 0.9 - 1.6.0)



Add PySpark library to the interpreter path (required for code completion):

Go to File -> Settings -> Project Interpreter
Open settings for an interpreter you want to use with Spark
Edit interpreter paths so it contains path to $SPARK_HOME/python (an Py4J if required)
Save the settings
Use newly created configuration to run your script.

Spark 2.2.0 and later:

With SPARK-1267 being merged you should be able to simplify the process by pip installing Spark in the environment you use for PyCharm development.

configuring pycharm IDE for pyspark - first script exception

Answers (1)

Related Questions