Reputation: 179

Filter a Spark DataFrame using values in a Dictionary

I need to filter a dataframe with a dict, constructed with the key being the column name and the value being the value that I want to filter on:

filter = {'column_1' = 'Y', 'column_2' = 'N'}

I understand how to do this using a Panda's DataFrame by using the below function

def filter(df, filters):
    for i in filters:
         filtered_df = df.loc[(df[list(filters)] == pd.Series(filters)).all(axis=1)]
  return filtered_df

However, .loc is not a method used outside of Pandas, and I haven't quite grasped what the Spark replica will be. I know they have a .locate method, but the syntax seems to be quite different.

Many Thanks

Upvotes: 1

Answers (2)

Vishal Kadam

Reputation: 23

I have below scenario

{'date': ['2021-12-18','2020-02-05']}

I want filter condition like below date = '2021-12-18' or date = '2020-02-05'

Upvotes: 1

mck

Reputation: 42392

Here's an example. You can construct an SQL expression string to filter the Spark dataframe.

filter = {'column_1': 'Y', 'column_2': 'N'}

df = spark.createDataFrame([['Y', 'N'], ['Y', 'Y']], ['column_1', 'column_2'])
df.show()
+--------+--------+
|column_1|column_2|
+--------+--------+
|       Y|       N|
|       Y|       Y|
+--------+--------+

filter_string = ' and '.join([f"{k} = '{v}'" for (k, v) in filter.items()])
print(filter_string)
# column_1 = 'Y' and column_2 = 'N'

filtered_df = df.filter(filter_string)
filtered_df.show()
+--------+--------+
|column_1|column_2|
+--------+--------+
|       Y|       N|
+--------+--------+

Or you can construct the filter condition as a Spark Column:

from functools import reduce

filter_col = reduce(lambda x, y: x & y, [F.col(k) == v for (k, v) in filter.items()])

filtered_df = df.filter(filter_col)

Upvotes: 3

Filter a Spark DataFrame using values in a Dictionary

Answers (2)

Related Questions