Reputation: 739
I've set up pyspark on an EC2 machine, with 2 nodes. I'm running pyspark using the command
pyspark --master spark://10.0.1.13:7077 --driver-memory 5G --executor-memory 12G --total-executor-cores 10
My python script fails specifically on running UDF functions only. I'm not able to debug as to why udf only and not any other part of the script or why not for the full script?
PATHS:
(base) [ec2-user@ip-10-0-1-13 ~]$ which pyspark
~/anaconda2/bin/pyspark
(base) [ec2-user@ip-10-0-1-13 ~]$ which python
~/anaconda2/bin/python
Python Script:
def getDateObjectYear(dateString):
dateString=dateString.strip()
return dateString
dateObjectUDFYear = udf(getDateObjectYear)
checkin_date_yelp_df=checkin_date_yelp_df.withColumn('year', dateObjectUDFYear(checkin_date_yelp_df.date))
On running checkin_date_yelp_df.show(5)
'I get this error
Py4JJavaError: An error occurred while calling o98.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 (TID 230, 10.0.1.13, executor 0): java.io.IOException: Cannot run program "~/anaconda2/bin/python": error=2, No such file or directory
..
..
..
..
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
Upvotes: 0
Views: 525
Reputation: 739
Turns out I had 2 path incorrectly configured in .bashrc
Correct way:
export PYTHONPATH=/home/ec2-user/anaconda/bin/python
export PYSPARK_PYTHON=/home/ec2-user/anaconda/bin/python
Upvotes: 1