Reputation: 9019
I had originally asked this question here, and I believe it was incorrectly marked as a duplicate. I will do my best here to clarify my question and how I believe it is unique.
Given the following example MultiIndex
dataframe:
import pandas as pd
import numpy as np
first = ['A', 'B', 'C']
second = ['a', 'b', 'c', 'd']
third = ['1', '2', '3']
indices = [first, second, third]
index = pd.MultiIndex.from_product(indices, names=['first', 'second', 'third'])
df = pd.DataFrame(np.random.randint(10, size=(len(first)*len(second)*len(third), 4)), index=index, columns=['Val1','Val2',' Val3', 'Val4'])
Goal: I would like to retain a specific level=1
index (such as 'a'
) if the value of column 'Val2'
corresponding to index value 1
in level=2
is greater than 5
for that level=1
index. Therefore, if this criteria is not met (i.e. column 'Val2'
is less than or equal to 5
for index 1
in level=2
), then the corresponding level=1
index would be removed from the dataframe. If all level=1
indices do not meet the criteria for a given level=0
index, then that level=0
index would also be removed. My previous post contains my expected output (I can add it here, but I wanted this post to be as succinct as possible for clarity).
Here is my current solution, the performance of which I'm sure can be improved:
grouped = df.groupby(level=0)
output = pd.concat([grouped.get_group(key).groupby(level=1).filter(lambda x: (x.loc[pd.IndexSlice[:, :, '1'], 'Val2']>5).any()) for key, group in grouped])
This does produce my desired output, but for a dataframe with 100,000's of rows, the performance is rather poor. Is there something obvious I am missing here to better utilize the under-the-hood optimization of pandas
?
Upvotes: 1
Views: 292
Reputation: 1431
I got the same result as your example solution by doing the following:
df.loc[df.xs('1', level=2)['Val2'] > 5]
Comparing time performance this is ~15X faster (in my machine your example takes 36ms while this take 2ms).
Upvotes: 1