I. A
I. A

Reputation: 2312

how to split strings in dataframe using pandas_udf in pyspark

I have a dataframe of one column only. I would like to split the string using the pandas_udf in pyspark. Hence, I have the following code:

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('str')
def split_msg(string):
  msg_ = string.split(" ")
  return msg_

temp = temp.select("_c6").withColumn("decoded", 
split_msg(temp._c6)).drop("_c6")

But this is not working.

any help is much appreciated!!

Upvotes: 0

Views: 1379

Answers (1)

jxc
jxc

Reputation: 13998

Change your function to the following:

@pandas_udf('array<string>', PandasUDFType.SCALAR) 
def split_msg(string): 
    msg_ = string.str.split(" ") 
    return msg_ 

basically, your function returnType should be array of StringType() and the argument string should be a Series and thus you will need string.str.split(" ")

However, if you just want to split the text, Spark's DataFrame API provides a built-in function, pyspark.sql.functions.split which should be more efficient than using a pandas_udf

Upvotes: 2

Related Questions