Reputation: 2312
I have a dataframe
of one column only. I would like to split the string using the pandas_udf
in pyspark
. Hence, I have the following code:
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('str')
def split_msg(string):
msg_ = string.split(" ")
return msg_
temp = temp.select("_c6").withColumn("decoded",
split_msg(temp._c6)).drop("_c6")
But this is not working.
any help is much appreciated!!
Upvotes: 0
Views: 1379
Reputation: 13998
Change your function to the following:
@pandas_udf('array<string>', PandasUDFType.SCALAR)
def split_msg(string):
msg_ = string.str.split(" ")
return msg_
basically, your function returnType should be array of StringType() and the argument string
should be a Series and thus you will need string.str.split(" ")
However, if you just want to split the text, Spark's DataFrame API provides a built-in function, pyspark.sql.functions.split which should be more efficient than using a pandas_udf
Upvotes: 2