Reputation: 321
I have pyspark dataframe
with some data and i want to substring
some data of a column, that column also contain some null
value.
here is my dataframe
+-------------+
| Name|
+--------------+
| Asia201909284|
| US20190928|
|Europ201909287|
| null|
| something|
| nothing|
+--------------+
all i want to eliminate Asia, US, Europ
from column Name
Here is my code that I already tried.
fun_asia = udf(lambda x: x[4:len(x)])
fun_us = udf(lambda x: x[2:len(x)])
fun_europ = udf(lambda x: x[5:len(x)])
df1.withColumn("replace", \
when(df1.Name.isNull(),df1.Name)\
.when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
.when(df1.Name.like("US%"),fun_us(col('Name')))\
.when(df1.Name.like("Europ%"),fun_europ(col('Name')))
.otherwise(df1.Name)
).show()
It worked properly if there is no null
value in that column. but if there is some null
value it gave a error like len()
cant calculate null value.
Error massage
TypeError: object of type 'NoneType' has no len()
Things that i confused why its calling fun also for null
value.
and how can i overcome my problem and get the result i want, any help appreciate.
Actual result that i want
+--------------+---------+
| Name| replace|
+--------------+---------+
| Asia201909284|201909284|
| US20190928| 20190928|
|Europ201909287|201909287|
| null| null|
| something|something|
| nothing| nothing|
+--------------+---------+
Upvotes: 2
Views: 4785
Reputation: 7419
One approach is using a when
with the isNull()
condition to handle the when column is null
condition:
df1.withColumn("replace", \
when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
.when(df1.Name.like("US%"),fun_us(col('Name')))\
.when(df1.Name.like("Europ%"),fun_europ(col('Name')))
.when(df1.Name.isNull(), df1.Name)
.otherwise(df1.Name)
).show()
You can change your udf to handle the nulls:
fun_asia = udf(lambda x: x[4:len(x)] if x else None)
fun_us = udf(lambda x: x[2:len(x)] if x else None)
fun_europ = udf(lambda x: x[5:len(x)] if x else None)
df1.withColumn("replace", \
when(df1.Name.isNull(),df1.Name)\
.when(df1.Name.like("Asia%"),fun_asia(col('Name')))\
.when(df1.Name.like("US%"),fun_us(col('Name')))\
.when(df1.Name.like("Europ%"),fun_europ(col('Name')))
.otherwise(df1.Name)
).show()
+--------------+---------+
| Name| replace|
+--------------+---------+
| Asia201909284|201909284|
| US20190928| 20190928|
|Europ201909287|201909287|
| null| null|
| something|something|
| nothing| nothing|
+--------------+---------+
Upvotes: 2