Reputation: 106
I want to know if there is a way to optimize a search that I am doing. I have a multi-index (3 levels) dataframe df
, like this:
IndexID IndexDateTime IndexAttribute ColumnA ColumnB
1 2015-02-05 8 A B
1 2015-02-05 7 C D
1 2015-02-10 7 X Y
My problem is that I want to know if a given date, lets say for example 2015-02-10
has data in ColumnA
, with the same IndexID
and IndexAttribute
, a given day before (5 in this case), and if it has, get it and add it to a new column, like this:
IndexID IndexDateTime IndexAttribute ColumnA ColumnB NewColumn
1 2015-02-05 8 A B -1
1 2015-02-05 7 C D -1
1 2015-02-10 7 X Y C
I want to perform this search on every row in my dataframe, which has 19million rows. The way I am doing this is:
df['NewColumn'] = df.apply(lambda r: get_data(df, r.IndexID, r.IndexDateTime , r.IndexAttribute , 5), axis=1)
Where get_data
is:
def get_data(df, IndexID, IndexDateTime , IndexAttribute , days_before):
idx = pd.IndexSlice
date = (IndexID - pd.to_timedelta(days_before, 'd'))
try:
res = df.loc[idx[IndexID, date, IndexAttribute ],'ColumnA']
return res
except KeyError:
return -1
This is extremely slow, taking over 2 hours. I was wondering if it could be a faster way. Problems:
IndexDateTame
I can not know how many IndexAttributes are.
They are ints thought, and they are in a descending order.I cant do a shift because I don't know how many data are in the middle of two rows. Some ideas? Thank you!
Upvotes: 1
Views: 246
Reputation: 2598
Using numpy can be pretty fast. You just need to iterate through the columns in the dataframe as numpy array. Hope it helps:
%time
def myfunc(df, days_before=5):
# Fill A column witH -1s
result = -np.ones_like(df.values[:, -1:])
# Slice the first 3 columns and shift the dates
# to get the index that we are looking for
idx = np.array((df['IndexID'].values,
df['IndexDateTime'] - pd.to_timedelta(days_before, 'd'),
df['IndexAttribute'].values)).T
# Look for days matching in the first 3 columns
_idx_comp = df.values[:, :3][np.newaxis, :] == np.array(idx)[:, np.newaxis]
# Get the index where there is a match
# between the row of the dataframe and the desired searched rows
idx_found = np.where(np.all(_idx_comp, axis=-1))
# Assign the corresponding rows to its required value
result[idx_found[0]] = df['ColumnA'].values[idx_found[-1]]
return result
df.assign(NewColumn=myfunc(df))
CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.96 µs
IndexID IndexDateTime IndexAttribute ColumnA ColumnB NewColumn
0 1 2015-02-05 8 A B -1
1 1 2015-02-05 7 C D -1
2 1 2015-02-10 7 X Y C
Upvotes: 1
Reputation: 323226
This is O(m.n) solution, however should faster than you original one
l = []
for _, y in df.groupby(level=[0, 2], sort=False):
s = y.index.get_level_values(level=1).values
l.append(((s - s[:, None]) / np.timedelta64(1, 'D') == -5).dot(y.ColumnA.values))
df['NewCOL'] = np.concatenate(l)
df
Out[48]:
ColumnA ColumnB NewCOL
IndexID IndexDateTime IndexAttribute
1 2015-02-05 8 A B
7 C D
2015-02-10 7 X Y C
Upvotes: 1