Unable to use pyspark udf

Question

I am trying to format the string in one the columns using pyspark udf.

Below is my dataset:

+--------------------+--------------------+
|             artists|                  id|
+--------------------+--------------------+
|     ['Mamie Smith']|0cS0A1fUEUd1EW3Fc...|
|"[""Screamin' Jay...|0hbkKFIJm7Z05H8Zl...|
|     ['Mamie Smith']|11m7laMUgmOKqI3oY...|
| ['Oscar Velazquez']|19Lc5SfJJ5O1oaxY0...|
|            ['Mixe']|2hJjbsLCytGsnAHfd...|
|['Mamie Smith & H...|3HnrHGLE9u2MjHtdo...|
|     ['Mamie Smith']|5DlCyqLyX2AOVDTjj...|
|['Mamie Smith & H...|02FzJbHtqElixxCmr...|
|['Francisco Canaro']|02i59gYdjlhBmbbWh...|
|          ['Meetya']|06NUxS2XL3efRh0bl...|
|        ['Dorville']|07jrRR1CUUoPb1FLf...|
|['Francisco Canaro']|0ANuF7SvPeIHanGcC...|
|        ['Ka Koula']|0BEO6nHi1rmTOPiEZ...|
|        ['Justrock']|0DH1IROKoPK5XTglU...|
|  ['Takis Nikolaou']|0HVjPaxbyfFcg8Rh0...|
|['Aggeliki Karagi...|0Hn7LWy1YcKhPaA2N...|
|['Giorgos Katsaros']|0I6DjrEfd3fKFESHE...|
|['Francisco Canaro']|0KGiP9EW1xtojDHsT...|
|['Giorgos Katsaros']|0KNI2d7l3ByVHU0g2...|
|     ['Amalia Vaka']|0LYNwxHYHPW256lO2...|
+--------------------+--------------------+

And code:

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
import logging as log

session = SparkSession.builder.master("local").appName("First Python App").getOrCreate()

df = session.read.option("header", "true").csv("/home/deepak/Downloads/spotify_data_Set/data.csv")
df = df.select("artists", "id")


# df = df.withColumn("new_atr",f.translate(f.col("artists"),'"', "")).\
#         withColumn("new_atr_2" , f.translate(f.col("artists"),'[', ""))
df.show()

def format_column(st):
    print(type(st))
    print(1)
    return st.upper()


session.udf.register("format_str", format_column)

df.select("id",format_column(df.artists)).show(truncate=False)

# schema = t.StructType(
#     [
#         t.StructField("artists", t.ArrayType(t.StringType()), True),
#         t.StructField("id", t.StringType(), True)
#
#     ]
# )

df.show(truncate=False)

The UDF is still not complete but with the error, I am not able to move further. When I run the above code I am getting below error:


1
Traceback (most recent call last):
  File "/home/deepak/PycharmProjects/Spark/src/test.py", line 25, in 
    df.select("id",format_column(df.artists)).show(truncate=False)
  File "/home/deepak/PycharmProjects/Spark/src/test.py", line 18, in format_column
    return st.upper()
TypeError: 'Column' object is not callable

The syntax looks fine and I am not able to figure out what wrong with the code.

Nassereddine BELGHITH · Accepted Answer

well , I see that you are using a predined spark function in the definition of an UDF which is acceptable as you said that you are starting with some examples , your error means that there is no method called upper for a column however you can correct that error using this defintion:

    @f.udf("string")
    def format_column(st):
        print(type(st))
        print(1)
        return st.upper()

for example :

Unable to use pyspark udf

Answers (2)

Related Questions