Liky
Liky

Reputation: 1247

Filter large DataFrame conditioned by information from small DataFrame

I have a large DataFrame with about 1 billion rows and about 15 columns.

| country|  city | date      | ...|
+--------+-------+-----------+----+
|  France|  Paris| 2018-07-01| ...|
|   Spain| Madrid| 2017-06-01| ...|

And I have a smaller DataFrame that contains the dates to filter based on the combination (country, city) - about 50 rows.

| country|  city | filter_date |
+--------+-------+-------------+
|  France|  Paris| 2018-07-01  |
|   Spain| Madrid| 2017-06-01  |

I would like to filter the large DataFrame by date using the filter_date stored in the small DataFrame for a given combination - for example remove any rows that contains (France, Paris) and is before the 2018-07-01, etc...

The solution I had in mind originally was just to do a left join and then filter such as:

df = df_large.join(df_small, on=['country', 'city'], how='left').filter(f.col('date') >= c.col('filter_date'))

However this solution is not ideal as the left join if very expensive and my DataFrame is too big. The code takes a very long time to run when taking an action after this operation.

Upvotes: 0

Views: 179

Answers (1)

Som
Som

Reputation: 6323

Try left semi join + broadcasting the smaller df. Also combine all filters using and as below-

 df_large.join(broadcast(df_small), df_large("country") === df_small("country") &&
      df_large("city") === df_small("city") && df_large("date") >= df_small("filter_date"), "leftsemi")

Upvotes: 1

Related Questions