how to split strings in dataframe using pandas_udf in pyspark

Question

I have a dataframe of one column only. I would like to split the string using the pandas_udf in pyspark. Hence, I have the following code:

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('str')
def split_msg(string):
  msg_ = string.split(" ")
  return msg_

temp = temp.select("_c6").withColumn("decoded", 
split_msg(temp._c6)).drop("_c6")

But this is not working.

any help is much appreciated!!

jxc · Accepted Answer

Change your function to the following:

@pandas_udf('array', PandasUDFType.SCALAR) 
def split_msg(string): 
    msg_ = string.str.split(" ") 
    return msg_

basically, your function returnType should be array of StringType() and the argument string should be a Series and thus you will need string.str.split(" ")

However, if you just want to split the text, Spark's DataFrame API provides a built-in function, pyspark.sql.functions.split which should be more efficient than using a pandas_udf

how to split strings in dataframe using pandas_udf in pyspark

Answers (1)

Related Questions