Reputation: 48
I have a pyspark df which has many columns but a subset looks like this:
datetime | eventid | sessionid | lat | lon | filtertype |
---|---|---|---|---|---|
someval | someval | someval | someval | someval | someval |
someval | someval | someval | someval | someval | someval |
I want to map a function some_func() which only makes use of the columns 'lat', 'lon' and 'event_id' to return a Boolean value which would be added to the df as a separate column named 'verified'. Basically I need to retrieve the columns of interest inside the function separately and do my operations on them. I know I can use UDFs or df.withColumn() but they are used to map to single column. For that I need to concatenate columns of interest as one column which would make the code a bit messy.
Is there a way to retrieve the column values inside the function separately and map that function to the entire dataframe? (similar to what we can do with Pandas df using map-lambda & df.apply())?
Upvotes: 1
Views: 2760
Reputation: 472
you can create a udf which can take up multiple column as parameters
ex:
from pyspark.sql.functions as f
from pyspark.sql.types import BooleanType
def your_function(p1, p2, p3):
# your logic goes here
# return a bool
udf_func = f.udf(your_function, BooleanType())
df = spark.read.....
df2 = df.withColumn("verified", udf_func(f.col("lat"), f.col("lon"), f.col("event_id")))
df2.show(truncate=False)
Upvotes: 2