Bob
Bob

Reputation: 415

Create a PySpark .where() statement with any number of conditions

I'm new to PySpark and I'm trying to create a generic .where() function, that can accept any number of conditions without me having to list them in the function.

The hope is to drive this from some sort of data-structure, so generalising a function of the form:

col_1_value = "Value_1"
col_2_value = "Value_2"
col_3_value = "Value_3"
(
  df.where(
      col("col_1").value == col_1_value) 
      & 
      col("col_2").value == col_2_value) 
      & 
      col("col_3").value == col_3_value) 
  )
)

Into something like:

(
    df.where(
       **where_conditions
    )
)

I suspect I will have to make use of **kwargs to allow this to happen, but I'm unsure if Spark already supports this sort of behaviour.

Upvotes: 1

Views: 559

Answers (1)

Czaporka
Czaporka

Reputation: 2407

You can put the filters in a list, then use functools.reduce to & them all:

>>> df = spark.createDataFrame(
...     [
...         (1, 1, 1),
...         (1, 1, 2),
...         (1, 1, 3),
...         (2, 1, 3),
...     ],
...     ["a", "b", "c"]
... )
>>> from pyspark.sql.functions import col
>>> filters = [
...     col("a") == 1,
...     col("c") >= 2,
... ]
>>> from functools import reduce
>>> df.where(reduce((lambda f1, f2: f1 & f2), filters)).show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  1|  2|
|  1|  1|  3|
+---+---+---+
>>> # you can also use the `operator` module instead of writing a lambda
>>> from operator import and_
>>> df.where(reduce(and_, filters)).show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  1|  2|
|  1|  1|  3|
+---+---+---+

Upvotes: 1

Related Questions