Reputation: 23
I am working with a large dataframe (~10M rows) that contains dates & textual data, and I have a list of values that I need to make some calculations per each value in that list.
For each value, I need to filter/subset my dataframe based on 4 conditions then make my calculations and move on to the next value. Currently, ~80% of the time is spent on the filters block making the processing time extremely long duration (few hours)
What I currently have is this:
for val in unique_list: # iterate on values in a list
if val is not None or val != kip: # as long as its an acceptable value
for year_num in range(1, 6): # split by years
# filter and make intermediate df based on per value & per year calculation
cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
cond_2 = df[f'{kip}'].notna()
cond_3 = df['Date'].dt.year < 2015 + year_num
cond_4 = df['Date'].dt.year >= 2015 + year_num -1
temp_df = df[cond_1 & cond_2 & cond_3 & cond_4].copy()
condition 1 takes around 45% of the time while conditions 3 & 4 take 22% each
is there a better way to implement this?, is there a way to remove .dt
and .str
and use something faster?
the time on 3 values (out of thousands)
Total time: 16.338 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def get_word_counts(df, kip, unique_list):
2 # to hold predictors
3 1 1929.0 1929.0 0.0 predictors_df = pd.DataFrame(index=[f'{kip}'])
4 1 2.0 2.0 0.0 n = 0
5
6 3 7.0 2.3 0.0 for val in unique_list: # iterate on values in a list
7 3 3.0 1.0 0.0 if val is not None or val != kip: # as long as its an acceptable value
8 18 39.0 2.2 0.0 for year_num in range(1, 6): # split by years
9
10 # filter and make intermediate df based on per value & per year calculation
11 15 7358029.0 490535.3 45.0 cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
12 15 992250.0 66150.0 6.1 cond_2 = df[f'{kip}'].notna()
13 15 3723789.0 248252.6 22.8 cond_3 = df['Date'].dt.year < 2015 + year_num
14 15 3733879.0 248925.3 22.9 cond_4 = df['Date'].dt.year >= 2015 + year_num -1
The data mainly looks like this (I use only relevant columns when doing the calculations):
Date Ingredient
20 2016-07-20 Magnesium
21 2020-02-18 <NA>
22 2016-01-28 Apple;Cherry;Lemon;Olives General;Peanut Butter
23 2015-07-23 <NA>
24 2018-01-11 <NA>
25 2019-05-30 Egg;Soy;Unspecified Egg;Whole Eggs
26 2020-02-20 Chocolate;Peanut;Peanut Butter
27 2016-01-21 Raisin
28 2020-05-11 <NA>
29 2020-05-15 Chocolate
30 2019-08-16 <NA>
31 2020-03-28 Chocolate
32 2015-11-04 <NA>
33 2016-08-21 <NA>
34 2015-08-25 Almond;Coconut
35 2016-12-18 Almond
36 2016-01-18 <NA>
37 2015-11-18 Peanut;Peanut Butter
38 2019-06-04 <NA>
39 2016-04-08 <NA>
Upvotes: 2
Views: 4157
Reputation: 26271
So, it looks like you really just want to split by year of the 'Date'
column, and do something with each group. Also, for a large df
, it is usually faster to filter what you can once beforehand, and then get a smaller one (in your example with one year worth of data), then do all your looping/extractions on the smaller df
.
Without knowing much more about the data itself (C-contiguous? F-contiguous? Date-sorted?), it's hard to be sure, but I would guess that the following may prove to be faster (and it also feels more natural IMHO):
# 1. do everything you can outside the loop
# 1.a prep your patterns
escaped_vals = [re.escape(str(val)) for val in unique_list
if val is not None and val != kip]
# you meant 'and', not 'or', right?
# 1.b filter and sort the data (why sort? better mem locality)
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# 2. do one groupby by year
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
year = date.year # optional, if you need it
# 2.b reuse each group as much as possible
for escval in escaped_vals:
mask = dfy[kip].str.contains(escval, na=False)
temp_df = dfy[mask].copy()
# do something with temp_df ...
Example (guessing some data, really):
n = 10_000_000
str_examples = ['hello', 'world', 'hi', 'roger', 'kilo', 'zulu', None]
df = pd.DataFrame({
'Date': [pd.Timestamp('2010-01-01') + k*pd.Timedelta('1 day') for k in np.random.randint(0, 3650, size=n)],
'x': np.random.randint(0, 1200, size=n),
'foo': np.random.choice(str_examples, size=n),
'bar': np.random.choice(str_examples, size=n),
})
unique_list = ['rld', 'oger']
kip = 'foo'
escaped_vals = [re.escape(str(val)) for val in unique_list
if val is not None and val != kip]
%%time
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# CPU times: user 1.67 s, sys: 124 ms, total: 1.79 s
%%time
out = defaultdict(dict)
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
year = date.year
for escval in escaped_vals:
mask = dfy[kip].str.contains(escval, na=False)
temp_df = dfy[mask].copy()
out[year].update({escval: temp_df})
# CPU times: user 2.64 s, sys: 0 ns, total: 2.64 s
Quick sniff test:
>>> out.keys()
dict_keys([2015, 2016, 2017, 2018, 2019])
>>> out[2015].keys()
dict_keys(['rld', 'oger'])
>>> out[2015]['oger'].shape
(142572, 4)
>>> out[2015]['oger'].tail()
Date x foo bar
3354886 2015-12-31 409 roger hello
8792739 2015-12-31 474 roger zulu
3944171 2015-12-31 310 roger hi
7578485 2015-12-31 125 roger None
2963220 2015-12-31 809 roger hi
Upvotes: 2