Handling null value in pyspark dataframe

Question

I have pyspark dataframe with some data and i want to substring some data of a column, that column also contain some null value. here is my dataframe

+-------------+
|          Name|
+--------------+
| Asia201909284|
|    US20190928|
|Europ201909287|
|          null|
|     something|
|       nothing|
+--------------+

all i want to eliminate Asia, US, Europ from column Name

Here is my code that I already tried.

fun_asia = udf(lambda x: x[4:len(x)])
fun_us = udf(lambda x: x[2:len(x)])
fun_europ = udf(lambda x: x[5:len(x)])
df1.withColumn("replace", \
               when(df1.Name.isNull(),df1.Name)\
               .when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
               .when(df1.Name.like("US%"),fun_us(col('Name')))\
               .when(df1.Name.like("Europ%"),fun_europ(col('Name')))
               .otherwise(df1.Name)
              ).show()

It worked properly if there is no null value in that column. but if there is some null value it gave a error like len() cant calculate null value.

Error massage

TypeError: object of type 'NoneType' has no len()

Things that i confused why its calling fun also for null value. and how can i overcome my problem and get the result i want, any help appreciate.

Actual result that i want

+--------------+---------+
|          Name|  replace|
+--------------+---------+
| Asia201909284|201909284|
|    US20190928| 20190928|
|Europ201909287|201909287|
|          null|     null|
|     something|something|
|       nothing|  nothing|
+--------------+---------+

pissall · Accepted Answer

One approach is using a when with the isNull() condition to handle the when column is null condition:

df1.withColumn("replace", \
               when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
               .when(df1.Name.like("US%"),fun_us(col('Name')))\
               .when(df1.Name.like("Europ%"),fun_europ(col('Name')))
               .when(df1.Name.isNull(), df1.Name)
               .otherwise(df1.Name)
              ).show()

EDIT2 :

You can change your udf to handle the nulls:

fun_asia = udf(lambda x: x[4:len(x)] if x else None)
fun_us = udf(lambda x: x[2:len(x)] if x else None)
fun_europ = udf(lambda x: x[5:len(x)] if x else None)
df1.withColumn("replace", \
               when(df1.Name.isNull(),df1.Name)\
               .when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
               .when(df1.Name.like("US%"),fun_us(col('Name')))\
               .when(df1.Name.like("Europ%"),fun_europ(col('Name')))
               .otherwise(df1.Name)
              ).show()
+--------------+---------+
|          Name|  replace|
+--------------+---------+
| Asia201909284|201909284|
|    US20190928| 20190928|
|Europ201909287|201909287|
|          null|     null|
|     something|something|
|       nothing|  nothing|
+--------------+---------+

Handling null value in pyspark dataframe

Answers (1)

EDIT2 :

Related Questions