Faster way to filter pandas DataFrame in For loop on multiple conditions

Question

I am working with a large dataframe (~10M rows) that contains dates & textual data, and I have a list of values that I need to make some calculations per each value in that list.

For each value, I need to filter/subset my dataframe based on 4 conditions then make my calculations and move on to the next value. Currently, ~80% of the time is spent on the filters block making the processing time extremely long duration (few hours)

What I currently have is this:

for val in unique_list:               # iterate on values in a list
    if val is not None or val != kip:   # as long as its an acceptable value
      for year_num in range(1, 6):      # split by years

        # filter and make intermediate df based on per value & per year calculation
        cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
        cond_2 = df[f'{kip}'].notna()
        cond_3 = df['Date'].dt.year < 2015 + year_num
        cond_4 = df['Date'].dt.year >= 2015 + year_num -1

        temp_df = df[cond_1 & cond_2 & cond_3 & cond_4].copy()

condition 1 takes around 45% of the time while conditions 3 & 4 take 22% each

is there a better way to implement this?, is there a way to remove .dt and .str and use something faster?

the time on 3 values (out of thousands)

Total time: 16.338 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def get_word_counts(df, kip, unique_list):
     2                                             # to hold predictors
     3         1       1929.0   1929.0      0.0    predictors_df = pd.DataFrame(index=[f'{kip}'])
     4         1          2.0      2.0      0.0    n = 0
     5                                             
     6         3          7.0      2.3      0.0    for val in unique_list:               # iterate on values in a list
     7         3          3.0      1.0      0.0      if val is not None or val != kip:   # as long as its an acceptable value
     8        18         39.0      2.2      0.0        for year_num in range(1, 6):      # split by years
     9                                           
    10                                                   # filter and make intermediate df based on per value & per year calculation
    11        15    7358029.0 490535.3     45.0          cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
    12        15     992250.0  66150.0      6.1          cond_2 = df[f'{kip}'].notna()
    13        15    3723789.0 248252.6     22.8          cond_3 = df['Date'].dt.year < 2015 + year_num
    14        15    3733879.0 248925.3     22.9          cond_4 = df['Date'].dt.year >= 2015 + year_num -1

The data mainly looks like this (I use only relevant columns when doing the calculations):

    Date    Ingredient
20  2016-07-20  Magnesium
21  2020-02-18  
22  2016-01-28  Apple;Cherry;Lemon;Olives General;Peanut Butter
23  2015-07-23  
24  2018-01-11  
25  2019-05-30  Egg;Soy;Unspecified Egg;Whole Eggs
26  2020-02-20  Chocolate;Peanut;Peanut Butter
27  2016-01-21  Raisin
28  2020-05-11  
29  2020-05-15  Chocolate
30  2019-08-16  
31  2020-03-28  Chocolate
32  2015-11-04  
33  2016-08-21  
34  2015-08-25  Almond;Coconut
35  2016-12-18  Almond
36  2016-01-18  
37  2015-11-18  Peanut;Peanut Butter
38  2019-06-04  
39  2016-04-08

Pierre D · Accepted Answer

So, it looks like you really just want to split by year of the 'Date' column, and do something with each group. Also, for a large df, it is usually faster to filter what you can once beforehand, and then get a smaller one (in your example with one year worth of data), then do all your looping/extractions on the smaller df.

Without knowing much more about the data itself (C-contiguous? F-contiguous? Date-sorted?), it's hard to be sure, but I would guess that the following may prove to be faster (and it also feels more natural IMHO):

# 1. do everything you can outside the loop
# 1.a prep your patterns
escaped_vals = [re.escape(str(val)) for val in unique_list
                if val is not None and val != kip]
                # you meant 'and', not 'or', right?

# 1.b filter and sort the data (why sort? better mem locality)
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')

# 2. do one groupby by year
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
    year = date.year  # optional, if you need it

    # 2.b reuse each group as much as possible
    for escval in escaped_vals:
        mask = dfy[kip].str.contains(escval, na=False)
        temp_df = dfy[mask].copy()
        # do something with temp_df ...

Example (guessing some data, really):

n = 10_000_000

str_examples = ['hello', 'world', 'hi', 'roger', 'kilo', 'zulu', None]
df = pd.DataFrame({
    'Date': [pd.Timestamp('2010-01-01') + k*pd.Timedelta('1 day') for k in np.random.randint(0, 3650, size=n)],
    'x': np.random.randint(0, 1200, size=n),
    'foo': np.random.choice(str_examples, size=n),
    'bar': np.random.choice(str_examples, size=n),
})

unique_list = ['rld', 'oger']
kip = 'foo'
escaped_vals = [re.escape(str(val)) for val in unique_list
                if val is not None and val != kip]

%%time
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# CPU times: user 1.67 s, sys: 124 ms, total: 1.79 s

%%time
out = defaultdict(dict)
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
    year = date.year

    for escval in escaped_vals:
        mask = dfy[kip].str.contains(escval, na=False)
        temp_df = dfy[mask].copy()
        out[year].update({escval: temp_df})
# CPU times: user 2.64 s, sys: 0 ns, total: 2.64 s

Quick sniff test:

>>> out.keys()
dict_keys([2015, 2016, 2017, 2018, 2019])

>>> out[2015].keys()
dict_keys(['rld', 'oger'])

>>> out[2015]['oger'].shape
(142572, 4)

>>> out[2015]['oger'].tail()
              Date    x    foo    bar
3354886 2015-12-31  409  roger  hello
8792739 2015-12-31  474  roger   zulu
3944171 2015-12-31  310  roger     hi
7578485 2015-12-31  125  roger   None
2963220 2015-12-31  809  roger     hi

Faster way to filter pandas DataFrame in For loop on multiple conditions

Answers (1)

Related Questions