Weird behavior in Pyspark

Question

I observed a weird behavior in PySpark. Maybe one of you will know what happens. If I do this:

def create_my_date(mydate):
       try:
           return mydate.strftime('%Y%m')
       except:
           return None

df = df.withColumn(
   "date_string", 
   F.udf(create_id, StringType())(df.mydate)
)

df.filter(~df.mydate.isNotNull()).count()
df.filter(df.mydate.isNotNull()).count()

This output:

0
10

It means I do not have Null value in the column df.mydate.

But If I change the create_my_date function and remove the try/except:

def create_my_date(mydate):
    return mydate.strftime('%Y%m')


df = df.withColumn(
    "date_string", 
    F.udf(create_id, StringType())(df.mydate)
)

df.filter(~df.mydate.isNotNull()).count()
df.filter(df.mydate.isNotNull()).count()

The JVM broke and say:

Py4JJavaError: An error occurred while calling o7058.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 997.0 failed 4 times, most recent failure: Lost task 22.3 in stage 997.0 (TID 335940, 126.102.230.110, executor 29): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 106, in 
    func = lambda _, it: map(mapper, it)
  File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 92, in 
    mapper = lambda a: udf(*a)
  File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 70, in 
    return lambda *a: f(*a)
  File "", line 2, in create_my_date
AttributeError: 'NoneType' object has no attribute 'strftime'

Do someone have an explanation for me ?

Thanks !

Weird behavior in Pyspark

Answers (1)

Related Questions