Reputation: 273
I am trying to format the string in one the columns using pyspark udf.
Below is my dataset:
+--------------------+--------------------+
| artists| id|
+--------------------+--------------------+
| ['Mamie Smith']|0cS0A1fUEUd1EW3Fc...|
|"[""Screamin' Jay...|0hbkKFIJm7Z05H8Zl...|
| ['Mamie Smith']|11m7laMUgmOKqI3oY...|
| ['Oscar Velazquez']|19Lc5SfJJ5O1oaxY0...|
| ['Mixe']|2hJjbsLCytGsnAHfd...|
|['Mamie Smith & H...|3HnrHGLE9u2MjHtdo...|
| ['Mamie Smith']|5DlCyqLyX2AOVDTjj...|
|['Mamie Smith & H...|02FzJbHtqElixxCmr...|
|['Francisco Canaro']|02i59gYdjlhBmbbWh...|
| ['Meetya']|06NUxS2XL3efRh0bl...|
| ['Dorville']|07jrRR1CUUoPb1FLf...|
|['Francisco Canaro']|0ANuF7SvPeIHanGcC...|
| ['Ka Koula']|0BEO6nHi1rmTOPiEZ...|
| ['Justrock']|0DH1IROKoPK5XTglU...|
| ['Takis Nikolaou']|0HVjPaxbyfFcg8Rh0...|
|['Aggeliki Karagi...|0Hn7LWy1YcKhPaA2N...|
|['Giorgos Katsaros']|0I6DjrEfd3fKFESHE...|
|['Francisco Canaro']|0KGiP9EW1xtojDHsT...|
|['Giorgos Katsaros']|0KNI2d7l3ByVHU0g2...|
| ['Amalia Vaka']|0LYNwxHYHPW256lO2...|
+--------------------+--------------------+
And code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
import logging as log
session = SparkSession.builder.master("local").appName("First Python App").getOrCreate()
df = session.read.option("header", "true").csv("/home/deepak/Downloads/spotify_data_Set/data.csv")
df = df.select("artists", "id")
# df = df.withColumn("new_atr",f.translate(f.col("artists"),'"', "")).\
# withColumn("new_atr_2" , f.translate(f.col("artists"),'[', ""))
df.show()
def format_column(st):
print(type(st))
print(1)
return st.upper()
session.udf.register("format_str", format_column)
df.select("id",format_column(df.artists)).show(truncate=False)
# schema = t.StructType(
# [
# t.StructField("artists", t.ArrayType(t.StringType()), True),
# t.StructField("id", t.StringType(), True)
#
# ]
# )
df.show(truncate=False)
The UDF is still not complete but with the error, I am not able to move further. When I run the above code I am getting below error:
<class 'pyspark.sql.column.Column'>
1
Traceback (most recent call last):
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 25, in <module>
df.select("id",format_column(df.artists)).show(truncate=False)
File "/home/deepak/PycharmProjects/Spark/src/test.py", line 18, in format_column
return st.upper()
TypeError: 'Column' object is not callable
The syntax looks fine and I am not able to figure out what wrong with the code.
Upvotes: 2
Views: 3644
Reputation: 640
well , I see that you are using a predined spark function in the definition of an UDF which is acceptable as you said that you are starting with some examples , your error means that there is no method called upper for a column however you can correct that error using this defintion:
@f.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
for example :
Upvotes: 1
Reputation: 32720
You get this error because you are calling the python function format_column
instead of the registered UDF format_str
.
You should be using :
from pyspark.sql import functions as F
df.select("id", F.expr("format_str(artists)")).show(truncate=False)
Moreover, the way you registered the UDF you can't use it with DataFrame API but only in Spark SQL. If you want to use it within DataFrame API you should define the function like this :
format_str = F.udf(format_column, StringType())
df.select("id", format_str(df.artists)).show(truncate=False)
Or using annotation syntax:
@F.udf("string")
def format_column(st):
print(type(st))
print(1)
return st.upper()
df.select("id", format_column(df.artists)).show(truncate=False)
That said, you should use Spark built-in functions (upper
in this case) unless you have a specific need that can't be done using Spark functions.
Upvotes: 2