Reputation: 375

Pyspark substring is not working inside of UDF

I'm trying in vain to use a Pyspark substring function inside of an UDF. Below is my code snippet -

from pyspark.sql.functions import substring

def my_udf(my_str):
    try:
        my_sub_str = substring(my_str,1, 2)
    except Exception:
        pass
    else:
        return (my_sub_str)

apply_my_udf = udf(my_udf)

df = input_data.withColumn("sub_str", apply_my_udf(input_data.col0))

The sample data is-

ABC1234
DEF2345
GHI3456

But when I print the df, I don't get any value in the new column "sub_str" as shown below -

[Row(col0='ABC1234', sub_str=None), Row(col0='DEF2345', sub_str=None), Row(col0='GHI3456', sub_str=None)]

Can anyone please let me know what I'm doing wrong?

Upvotes: 0

Answers (2)

michalrudko

Reputation: 1530

If you need to use udf for that, you could also try something like:

input_data = spark.createDataFrame([
    (1,"ABC1234"), 
    (2,"DEF2345"),
    (3,"GHI3456")
], ("id","col0"))

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

udf1 = udf(lambda x:x[0:2],StringType())
df.withColumn('sub_str',udf1('col0')).show()

+---+-------+-------+
| id|   col0|sub_str|
+---+-------+-------+
|  1|ABC1234|     AB|
|  2|DEF2345|     DE|
|  3|GHI3456|     GH|
+---+-------+-------+

However, as Mohamed Ali JAMAOUI wrote - you could do without udf easily here.

Upvotes: 1

Mohamed Ali JAMAOUI

Reputation: 14699

You don't need a udf to use substring, here's a cleaner and faster way:

>>> from pyspark.sql import functions as f
>>> df.show()
+-------+
|   data|
+-------+
|ABC1234|
|DEF2345|
|GHI3456|
+-------+

>>> df.withColumn("sub_str", f.substring("data", 1, 2)).show()
+-------+-------+
|   data|sub_str|
+-------+-------+
|ABC1234|     AB|
|DEF2345|     DE|
|GHI3456|     GH|
+-------+-------+

Upvotes: 1

Pyspark substring is not working inside of UDF

Answers (2)

Related Questions