Reputation: 5933
I have a script that worked fine as long as I did not upgraded Pandas (0.24 --> 1.1.2). I have to select a list of dates from a dataframe, which has a multiindex that includes dates and products and the columns are the metrics.
So far i have done this and it worked liked a charm: df.loc[selected_dates]
Where selected_dates
is the list of dates. Some dates may be missing, and it's okay, because nan values will be dropped.
Now I have to sacrifice a goat figuratively to achieve the same with the method suggested by the guide (using reindexing) and it is still prone to errors.
How can I simply get these rows again with the new API?
Here's a sample data frame:
import pandas as pd
from random import randint
selected_dates = ['2020-09-02', '2020-09-05', '2020-09-10']
idx = pd.MultiIndex.from_product([pd.date_range('2020-09-01', '2020-09-07'), ['prod1', 'prod2', 'prod3']])
metric1 = [i for i in range(len(idx))]
metric2 = [10*i for i in range(len(idx))]
df = pd.DataFrame(dict(metric1=metric1, metric2=metric2), index=idx)
df.loc[selected_dates] # KeyError
The df
data frame will look like this:
metric1 metric2
2020-09-01 prod1 0 0
prod2 1 10
prod3 2 20
2020-09-02 prod1 3 30
prod2 4 40
prod3 5 50
...
Please think in big! These frame are quite large, and selection is done many times. I have already looked at this SO answer, but it's rather relevant for single indices, not multiindex.
Any hints are appreciated!
Upvotes: 1
Views: 235
Reputation: 862681
Solution with DataFrame.loc
use converted values to datetimes for match, then use IndexSlice
with first :
for match all values in second level of MultiIndex
and second :
for match all columns:
idx = pd.IndexSlice
selected_dates = pd.to_datetime(selected_dates)
df = df.loc[idx[selected_dates,:], :]
print (df)
metric1 metric2
2020-09-02 prod1 3 30
prod2 4 40
prod3 5 50
2020-09-05 prod1 12 120
prod2 13 130
prod3 14 140
Or use Index.get_level_values
with Index.isin
in boolean indexing
:
selected_dates = pd.to_datetime(selected_dates)
df = df[df.index.get_level_values(0).isin(selected_dates)]
print (df)
metric1 metric2
2020-09-02 prod1 3 30
prod2 4 40
prod3 5 50
2020-09-05 prod1 12 120
prod2 13 130
prod3 14 140
Upvotes: 1