Reputation: 636
I have a pyspark dataframe which looks like below
df
num11 num21
10 10
20 30
5 25
I am filtering above dataframe on all columns present, and selecting rows with number greater than 10 [no of columns can be more than two]
from pyspark.sql.functions import col
col_list = df.schema.names
df_fltered = df.where(col(c) >= 10 for c in col_list)
desired output is :
num11 num21
10 10
20 30
How can we achieve filtering on multiple columns using iteration on column list as above. [all efforts are appriciated]
[error i reveive is : condition should be string or column]
Upvotes: 2
Views: 6724
Reputation: 2696
As an alternative, if you not averse to some sql-like snippets of code, the following should work:
df.where("AND".join(["(%s >=10)"%(col) for col in col_list]))
Upvotes: 2
Reputation: 214927
You can use functools.reduce
to combine the column conditions, to simulate an all condition, for instance, you can use reduce(lambda x, y: x & y, ...)
:
import pyspark.sql.functions as F
from functools import reduce
df.where(reduce(lambda x, y: x & y, (F.col(x) >= 10 for x in df.columns))).show()
+-----+-----+
|num11|num21|
+-----+-----+
| 10| 10|
| 20| 30|
+-----+-----+
Upvotes: 1