Why is pySpark failing to run udf functions only?

Question

I've set up pyspark on an EC2 machine, with 2 nodes. I'm running pyspark using the command

 pyspark --master spark://10.0.1.13:7077 --driver-memory 5G --executor-memory 12G --total-executor-cores 10

My python script fails specifically on running UDF functions only. I'm not able to debug as to why udf only and not any other part of the script or why not for the full script?

PATHS:

(base) [ec2-user@ip-10-0-1-13 ~]$ which pyspark
~/anaconda2/bin/pyspark
(base) [ec2-user@ip-10-0-1-13 ~]$ which python
~/anaconda2/bin/python

Python Script:

def getDateObjectYear(dateString):
    dateString=dateString.strip()
    return dateString

dateObjectUDFYear = udf(getDateObjectYear)

checkin_date_yelp_df=checkin_date_yelp_df.withColumn('year', dateObjectUDFYear(checkin_date_yelp_df.date))

On running checkin_date_yelp_df.show(5) 'I get this error

Py4JJavaError: An error occurred while calling o98.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 (TID 230, 10.0.1.13, executor 0): java.io.IOException: Cannot run program "~/anaconda2/bin/python": error=2, No such file or directory
..
..
..
..
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.(UNIXProcess.java:247)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)

Why is pySpark failing to run udf functions only?

Answers (1)

Related Questions