Deepak_Spark_Beginner
Deepak_Spark_Beginner

Reputation: 273

Unable to use pyspark udf

I am trying to format the string in one the columns using pyspark udf.

Below is my dataset:

+--------------------+--------------------+
|             artists|                  id|
+--------------------+--------------------+
|     ['Mamie Smith']|0cS0A1fUEUd1EW3Fc...|
|"[""Screamin' Jay...|0hbkKFIJm7Z05H8Zl...|
|     ['Mamie Smith']|11m7laMUgmOKqI3oY...|
| ['Oscar Velazquez']|19Lc5SfJJ5O1oaxY0...|
|            ['Mixe']|2hJjbsLCytGsnAHfd...|
|['Mamie Smith & H...|3HnrHGLE9u2MjHtdo...|
|     ['Mamie Smith']|5DlCyqLyX2AOVDTjj...|
|['Mamie Smith & H...|02FzJbHtqElixxCmr...|
|['Francisco Canaro']|02i59gYdjlhBmbbWh...|
|          ['Meetya']|06NUxS2XL3efRh0bl...|
|        ['Dorville']|07jrRR1CUUoPb1FLf...|
|['Francisco Canaro']|0ANuF7SvPeIHanGcC...|
|        ['Ka Koula']|0BEO6nHi1rmTOPiEZ...|
|        ['Justrock']|0DH1IROKoPK5XTglU...|
|  ['Takis Nikolaou']|0HVjPaxbyfFcg8Rh0...|
|['Aggeliki Karagi...|0Hn7LWy1YcKhPaA2N...|
|['Giorgos Katsaros']|0I6DjrEfd3fKFESHE...|
|['Francisco Canaro']|0KGiP9EW1xtojDHsT...|
|['Giorgos Katsaros']|0KNI2d7l3ByVHU0g2...|
|     ['Amalia Vaka']|0LYNwxHYHPW256lO2...|
+--------------------+--------------------+

And code:

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
import logging as log

session = SparkSession.builder.master("local").appName("First Python App").getOrCreate()

df = session.read.option("header", "true").csv("/home/deepak/Downloads/spotify_data_Set/data.csv")
df = df.select("artists", "id")


# df = df.withColumn("new_atr",f.translate(f.col("artists"),'"', "")).\
#         withColumn("new_atr_2" , f.translate(f.col("artists"),'[', ""))
df.show()

def format_column(st):
    print(type(st))
    print(1)
    return st.upper()


session.udf.register("format_str", format_column)

df.select("id",format_column(df.artists)).show(truncate=False)

# schema = t.StructType(
#     [
#         t.StructField("artists", t.ArrayType(t.StringType()), True),
#         t.StructField("id", t.StringType(), True)
#
#     ]
# )

df.show(truncate=False)

The UDF is still not complete but with the error, I am not able to move further. When I run the above code I am getting below error:

<class 'pyspark.sql.column.Column'>
1
Traceback (most recent call last):
  File "/home/deepak/PycharmProjects/Spark/src/test.py", line 25, in <module>
    df.select("id",format_column(df.artists)).show(truncate=False)
  File "/home/deepak/PycharmProjects/Spark/src/test.py", line 18, in format_column
    return st.upper()
TypeError: 'Column' object is not callable

The syntax looks fine and I am not able to figure out what wrong with the code.

Upvotes: 2

Views: 3644

Answers (2)

Nassereddine BELGHITH
Nassereddine BELGHITH

Reputation: 640

well , I see that you are using a predined spark function in the definition of an UDF which is acceptable as you said that you are starting with some examples , your error means that there is no method called upper for a column however you can correct that error using this defintion:

    @f.udf("string")
    def format_column(st):
        print(type(st))
        print(1)
        return st.upper()

for example :

enter image description here

Upvotes: 1

blackbishop
blackbishop

Reputation: 32720

You get this error because you are calling the python function format_column instead of the registered UDF format_str.

You should be using :

from pyspark.sql import functions as F

df.select("id", F.expr("format_str(artists)")).show(truncate=False)

Moreover, the way you registered the UDF you can't use it with DataFrame API but only in Spark SQL. If you want to use it within DataFrame API you should define the function like this :

format_str = F.udf(format_column, StringType())

df.select("id", format_str(df.artists)).show(truncate=False)

Or using annotation syntax:

@F.udf("string")
def format_column(st):
    print(type(st))
    print(1)
    return st.upper()

df.select("id", format_column(df.artists)).show(truncate=False) 

That said, you should use Spark built-in functions (upper in this case) unless you have a specific need that can't be done using Spark functions.

Upvotes: 2

Related Questions