Vikash Balasubramanian
Vikash Balasubramanian

Reputation: 3233

Pandas multi index partial selection using list of tuples

Consider the following dataframe

import pandas as pd
import numpy as np

arr = np.random.random((2, 4))
mdf = pd.DataFrame({'cid': ['c1', 'c2']})
pdf = pd.DataFrame({'doc_id': ['d1', 'd1', 'd2', 'd2'], 'passage_id': ['p1', 'p2', 'p1', 'p2']})
index = pd.MultiIndex.from_frame(mdf.join(pdf, how='cross'))
df = pd.DataFrame({'score': arr.flatten()}, index=index)

df is

                             score
cid     doc_id  passage_id  
c1      d1       p1          0.708722
                 p2          0.975350
        d2       p1          0.326029
                 p2          0.979832
c2      d1       p1          0.147153
                 p2          0.381807
        d2       p1          0.525054
                 p2          0.245478

Now If i try to index using a list of tuples using only two levels

df.loc[[('c1', 'd1'), ('c2', 'd2')]]

I get the following error:

ValueError: operands could not be broadcast together with shapes (2,2) (3,) (2,2)

Why is this error happening ?

I expected the answer to be:

                             score
cid     doc_id  passage_id  
c1      d1       p1          0.708722
                 p2          0.975350
c2      d2       p1          0.525054
                 p2          0.245478

Upvotes: 1

Views: 725

Answers (3)

Steele Farnsworth
Steele Farnsworth

Reputation: 893

To add to the solutions that have been provided:

df.reset_index(level=2).loc[[('c1', 'd1'), ('c2', 'd2')]].set_index('passage_id', append=True)

I wish I could think of a more elegant solution. Here's the breakdown of what this is doing:

  • .reset_index(level=2) moves the third index from the left into the "body" of the DataFrame (as a regular column).
  • Now that there's only two levels of indexing left, .loc[[('c1', 'd1'), ('c2', 'd2')]] gets the rows that you wanted.
  • .set_index('passage_id', append=True) moves the passage_id column back into the index.

Upvotes: 0

Corralien
Corralien

Reputation: 120409

You can use get_locs:

loc = df.index.get_locs
idx = np.union1d(loc(('c1', 'd1')), loc(('c2', 'd2')))
subdf = df.iloc[idx]

Output:

>>> subdf
                          score
cid doc_id passage_id          
c1  d1     p1          0.055452
           p2          0.758224
c2  d2     p1          0.773690
           p2          0.519005

>>> idx
array([0, 1, 6, 7])

Upvotes: 1

BENY
BENY

Reputation: 323226

A little bit over thinking since we need the multiple index dataframe

inputtuple =pd.DataFrame([('c1', 'd1'), ('c2', 'd2')],columns = ['cid','doc_id']) 
out = df.reset_index().merge(inputtuple).set_index(df.index.names)
Out[199]: 
                          score
cid doc_id passage_id          
c1  d1     p1          0.428390
           p2          0.931326
c2  d2     p1          0.160805
           p2          0.476747

Upvotes: 1

Related Questions