Reputation: 1247
I have a large DataFrame with about 1 billion rows and about 15 columns.
| country| city | date | ...|
+--------+-------+-----------+----+
| France| Paris| 2018-07-01| ...|
| Spain| Madrid| 2017-06-01| ...|
And I have a smaller DataFrame that contains the dates to filter based on the combination (country, city) - about 50 rows.
| country| city | filter_date |
+--------+-------+-------------+
| France| Paris| 2018-07-01 |
| Spain| Madrid| 2017-06-01 |
I would like to filter the large DataFrame by date using the filter_date stored in the small DataFrame for a given combination - for example remove any rows that contains (France, Paris) and is before the 2018-07-01, etc...
The solution I had in mind originally was just to do a left join and then filter such as:
df = df_large.join(df_small, on=['country', 'city'], how='left').filter(f.col('date') >= c.col('filter_date'))
However this solution is not ideal as the left join if very expensive and my DataFrame is too big. The code takes a very long time to run when taking an action after this operation.
Upvotes: 0
Views: 179
Reputation: 6323
Try left semi
join + broadcasting
the smaller df. Also combine all filters using and
as below-
df_large.join(broadcast(df_small), df_large("country") === df_small("country") &&
df_large("city") === df_small("city") && df_large("date") >= df_small("filter_date"), "leftsemi")
Upvotes: 1