Reputation: 415
I'm new to PySpark and I'm trying to create a generic .where() function, that can accept any number of conditions without me having to list them in the function.
The hope is to drive this from some sort of data-structure, so generalising a function of the form:
col_1_value = "Value_1"
col_2_value = "Value_2"
col_3_value = "Value_3"
(
df.where(
col("col_1").value == col_1_value)
&
col("col_2").value == col_2_value)
&
col("col_3").value == col_3_value)
)
)
Into something like:
(
df.where(
**where_conditions
)
)
I suspect I will have to make use of **kwargs
to allow this to happen, but I'm unsure if Spark already supports this sort of behaviour.
Upvotes: 1
Views: 559
Reputation: 2407
You can put the filters in a list, then use functools.reduce
to &
them all:
>>> df = spark.createDataFrame(
... [
... (1, 1, 1),
... (1, 1, 2),
... (1, 1, 3),
... (2, 1, 3),
... ],
... ["a", "b", "c"]
... )
>>> from pyspark.sql.functions import col
>>> filters = [
... col("a") == 1,
... col("c") >= 2,
... ]
>>> from functools import reduce
>>> df.where(reduce((lambda f1, f2: f1 & f2), filters)).show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 2|
| 1| 1| 3|
+---+---+---+
>>> # you can also use the `operator` module instead of writing a lambda
>>> from operator import and_
>>> df.where(reduce(and_, filters)).show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 2|
| 1| 1| 3|
+---+---+---+
Upvotes: 1