Siab Shafique
Siab Shafique

Reputation: 48

Mapping a function to multiple columns of pyspark dataframe

I have a pyspark df which has many columns but a subset looks like this:

datetime eventid sessionid lat lon filtertype
someval someval someval someval someval someval
someval someval someval someval someval someval

I want to map a function some_func() which only makes use of the columns 'lat', 'lon' and 'event_id' to return a Boolean value which would be added to the df as a separate column named 'verified'. Basically I need to retrieve the columns of interest inside the function separately and do my operations on them. I know I can use UDFs or df.withColumn() but they are used to map to single column. For that I need to concatenate columns of interest as one column which would make the code a bit messy.

Is there a way to retrieve the column values inside the function separately and map that function to the entire dataframe? (similar to what we can do with Pandas df using map-lambda & df.apply())?

Upvotes: 1

Views: 2760

Answers (1)

hprakash
hprakash

Reputation: 472

you can create a udf which can take up multiple column as parameters

ex:

from pyspark.sql.functions as f
from pyspark.sql.types import BooleanType

def your_function(p1, p2, p3):
    # your logic goes here
    # return a bool

udf_func = f.udf(your_function, BooleanType())


df = spark.read.....

df2 = df.withColumn("verified", udf_func(f.col("lat"), f.col("lon"), f.col("event_id")))

df2.show(truncate=False)

Upvotes: 2

Related Questions