Reputation: 13001
I am trying to create some scripts for pyspark using pycharm. While I found multiple explanation on how to connect them (such as How to link PyCharm with PySpark?) not everything works properly.
What I did is basically set the environment variables correctly:
echo $PYTHONPATH
:/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip
echo $SPARK_HOME
/usr/local/spark
and in the code I have:
appName = "demo1"
master = "local"
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
The problem is that many dataframe aggregation functions appear as errors. For example I have the following lines:
from pyspark.sql import functions as agg_funcs
maxTimeStamp = base_df.agg(agg_funcs.max(base_df.time)).collect()
Yet pycharm claims: Cannot find reference 'max' in functions.py A similar error apepars for most aggregate functions (e.g. col, count)
How would I fix this?
Upvotes: 2
Views: 2606
Reputation: 31
pycharm -> settings -> project -> Project structure -> add root content
select the following path from spark installation folder
spark/python/lib/py4j....sr.zip
spark/python/lib/pyspark.zip
Upvotes: 2
Reputation: 13001
This is due to the limitation of python analysis in pycharm. Since pyspark generates some of its function on the fly. I have actually opened an issue with pycharm (https://youtrack.jetbrains.com/issue/PY-20200). which provides some solutions which is basically to write some interface code manually.
Update:
If you look at this thread you can see some advancement in the topic. This has a working interface for most stuff and here is some more info on it.
Upvotes: 1
Reputation: 12607
Writing scripts in PyCharm is great, but for running them I advise you to use the spark-submit
command right from the console to execute them.
If you really want to run them straight from PyCharm there is a great github project called findspark which allows you to do just what you are asking for.
Install the library and just add to the top of your code
import findspark
findspark.init()
The rest of the code goes just below that and findspark will do all of the job for you!
Upvotes: 0