Delosari
Delosari

Reputation: 693

Slincing a pandas MultiIndex dataframe by one index where two row index value exist

I wonder if anyone could please offer some advice:

I have a data set with the following structure:

import pandas as pd

# Create individual pandas DataFrame.
df1 = pd.DataFrame({'Col1': [1, 2, 3, 4], 'Col2': [99, 98, 95, 90]}, index=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame({'Col1': [1, 2], 'Col2': [99, 98]}, index=['A', 'B'])
df3 = pd.DataFrame({'Col1': [3, 4], 'Col2': [95, 90]}, index=['C', 'D'])
df4 = pd.DataFrame({'Col1': [3, 4], 'Col2': [95, 90]}, index=['B', 'C'])

# Combine into one multi-index dataframe
df_dict = dict(obj1=df1, obj2=df2, obj3=df3, obj4=df4)

# Assign multi-index labels
mDF = pd.concat(list(df_dict.values()), keys=list(df_dict.keys()))
mDF.rename_axis(index=["ID", "property"], inplace=True)
print(mDF, '\n')

These multi-index dataframes have different number of "property" rows:

               Col1  Col2
ID   property            
obj1 A            1    99
     B            2    98
     C            3    95
     D            4    90
obj2 A            1    99
     B            2    98
obj3 C            3    95
     D            4    90
obj4 B            3    95
     C            4    90 

For example, I would like to calculate the sum of Col1 values for property A and B or all "IDs". However, this is only possible for those "ID" which have both properties tabulated.

I have tried to use the isin and query attributes:

idcs_isin = mDF.index.get_level_values('property').isin(['A', 'B'])
idcs_query = mDF.query('property in ["A","B"]')
print(f'isin:\n{mDF.loc[idcs_isin]}\n')
print(f'Query:\n{idcs_query}')

However, this returns any "ID" with either of the properties:

               Col1  Col2
ID   property            
obj1 A            1    99
     B            2    98
obj2 A            1    99
     B            2    98
obj4 B            3    95

Query:
               Col1  Col2
ID   property            
obj1 A            1    99
     B            2    98
obj2 A            1    99
     B            2    98
obj4 B            3    95

Which function should I use to recover the IDs "obj1" and "obj2" the only ones which have both the A and B properties?

Upvotes: 1

Views: 72

Answers (2)

sammywemmy
sammywemmy

Reputation: 28669

One option is to build a boolean array, which captures rows where both A and B exist per ID, and then use that array to index mDF:

bools = mDF.index.get_level_values('property').isin(['A','B'])
grouper = mDF.index.get_level_values('ID')
# there should be a minimum of two (`A`, `B`)
bools = pd.Series(bools).groupby(grouper).transform('sum').ge(2).array
mDF.loc[bools]
               Col1  Col2
ID   property            
obj1 A            1    99
     B            2    98
     C            3    95
     D            4    90
obj2 A            1    99
     B            2    98

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150745

Query works with named index:

mDF.query('property in ["A","B"]')

Output:

               Col1  Col2
ID   property            
obj1 A            1    99
     B            2    98
obj2 A            1    99
     B            2    98

Formatting the query string can be tricky sometimes. If you don't feel comfortable with it, you can use Ben.T's suggestion with .index.get_level_values.

Upvotes: 1

Related Questions