banerjs
banerjs

Reputation: 548

Selecting subset tuples of pairs of levels in MultiIndex

(Apologies if this is a repost. The closest answer I could find was this: multiindex selecting in pandas and it doesn't quite do what I want. Instead I'll use the data from that question as an example because my own data has a similar setup but would require a bit more explanation)

Here's the data (reposted):

                    0  1  2  3
first second third            
C     one    mean   3  4  2  7
             std    4  1  7  7
      two    mean   3  1  4  7
             std    5  6  7  0
      three  mean   7  0  2  5
             std    7  3  7  1
H     one    mean   2  4  3  3
             std    5  5  3  5
      two    mean   5  7  0  6
             std    0  1  0  2
      three  mean   5  2  5  1
             std    9  0  4  6
V     one    mean   3  7  3  9
             std    8  7  9  3
      two    mean   1  9  9  0
             std    1  1  5  1
      three  mean   3  1  0  6
             std    6  2  7  4

Based on a prior query, I've got a set of tuples of first and second index levels based on the values in column 0 that are less than equal to 3. For example:

# selects rows where column 0 has a 'mean' value <= 3.
ser = df.loc[(slice(None), slice(None), 'mean'), 0]
ser = ser[ser <= 3]

idx_tuples = [(val[0], val[1]) for val in ser.index]
# Has value: [(C, one), (C, two), (H, one), (V, one), (V, two), (V, three)]

Now I would like to select both the mean and std rows from the original frame (df) for index values that satisfy the above set of tuples. I could iterate through each tuple individually followed by a concatenate operation, but I was wondering if there is a more Pythonic / Pandas-ic(?) way to achieve what I want with intelligent slicing, etc. (particularly because the number of tuples in my dataset numbers in the thousands)?

Thanks!

PS: Note that my problem arises in the fact that specific tuples of the first 2 levels satisfy the selection condition, so I can't do a naive cross-product of values from get_level_values.

Upvotes: 1

Views: 1249

Answers (3)

hm8
hm8

Reputation: 1523

No idea about the efficiency of this approach, but I like its simplicity:

sel_df = df.reset_index('third').loc[idx_tuples].set_index('third', append=True)

                    0  1  2  3
first second third            
C     two    mean   0  7  6  1
             std    6  9  3  8
      three  mean   1  3  2  5
             std    9  4  4  1
H     one    mean   2  6  0  0
             std    9  4  9  6
      two    mean   0  1  8  6
             std    6  1  3  1
      three  mean   3  4  6  7
             std    8  4  8  8

Upvotes: 0

Ben.T
Ben.T

Reputation: 29635

you can use the fact that unstack and later stack will remove the nan and use where between to replace by nan the rows you don't want.

df_ = (df.unstack('third')
         .where(lambda x: x[(0,'mean')].le(3))
         .stack('third')
      )
print (df_) #for me as I don't have the same random values
                      0    1    2    3
first second third                    
C     two    mean   1.0  3.0  7.0  8.0
             std    9.0  0.0  0.0  9.0
H     one    mean   3.0  0.0  2.0  1.0
             std    7.0  9.0  2.0  7.0
      two    mean   0.0  2.0  1.0  0.0
             std    6.0  5.0  6.0  5.0

Upvotes: 2

ALollz
ALollz

Reputation: 59579

If you can't get away from that list of tuples, I'd suggest a merge. The merge will destroy the MultiIndex, so we need to remove it first, so that the 'third' level is brought along and then we reconstruct it in the end.

cols = ['first', 'second']
df = (df.reset_index()
        .merge(pd.DataFrame(idx_tuples, columns=cols), on=cols)
        .set_index(df.index.names))

                    0  1  2  3
first second third            
C     one    mean   3  4  2  7
             std    4  1  7  7
      two    mean   3  1  4  7
             std    5  6  7  0
H     one    mean   2  4  3  3
             std    5  5  3  5
V     one    mean   3  7  3  9
             std    8  7  9  3
      two    mean   1  9  9  0
             std    1  1  5  1
      three  mean   3  1  0  6
             std    6  2  7  4

Another alternative is to change how you create this selection. You can instead use transform to get a Boolean Mask across the ['first', 'second'] levels if the 'mean' row is <=3.

m = ((df[0].le(3) & (df.index.get_level_values('third') == 'mean'))
        .groupby(['first', 'second']).transform('any'))

df.loc[m]
# Same as above

Upvotes: 2

Related Questions