Reputation: 93
I am trying to select rows (including repeats) from a two-level pandas MultiIndex dataframe, using .loc indexing, using a list of labels.
However, if I try this type of indexing with a MultiIndex dataframe, the order of rows of the output is the same as the input, and the repeated indices are ignored. Here is an example:
import numpy as np
import pandas as pd
import string as s
index1 = list(s.ascii_uppercase[:4])
index2 = np.arange(2)
col_names='col1 col2 col3'.split()
new_slices = list('DDAB') # note order and repition of labels
multi_index = pd.MultiIndex.from_product([index1,index2],names=["level0","level1"])
data = np.arange(len(index1)*len(index2)*len(col_names))
data=data.reshape(len(index1)*len(index2),-1)
df2 = pd.DataFrame(data,columns=col_names,index=multi_index)
print(df2.loc[new_slices])
col1 col2 col3
level0 level1
A 0 0 1 2
1 3 4 5
B 0 6 7 8
1 9 10 11
D 0 18 19 20
1 21 22 23
I would instead expect:
col1 col2 col3
level0 level1
D 0 18 19 20
1 21 22 23
D 0 18 19 20
1 21 22 23
A 0 0 1 2
1 3 4 5
B 0 6 7 8
1 9 10 11
Is there a MultiIndex-specific function that I have missed? Or am I misunderstanding how the levels in the MultiIndex work?
(However, this works as I expect when selecting from a 'regular' dataframe, for example:)
import numpy as np
import pandas as pd
import string as s
index1 = list(s.ascii_uppercase[:4])
col_names='col1 col2 col3'.split()
new_slices = list('DDAB') # note order and repition of labels
data1 = np.arange(len(index1)*len(col_names)).reshape(len(index1),-1)
df1 = pd.DataFrame(data1,columns=col_names,index=index1)
print(df1)
print(df1.loc[new_slices])
which gives the result I would expect -- a dataframe with rows D,D,A,B.
Upvotes: 4
Views: 1390
Reputation: 2049
You haven't missed anything. This is a result of the way pandas implements indexing. Partial indexing won't duplicate rows, only complete indexing.
Partial: two of just the first level:
df2.loc[['A', 'A']]
col1 col2 col3
level0 level1
A 0 0 1 2
1 3 4 5
Partial: two of just the second level:
df2.loc[(slice(None), [0, 0]), :]
col1 col2 col3
level0 level1
A 0 0 1 2
B 0 6 7 8
C 0 12 13 14
D 0 18 19 20
However, complete indexing does duplicate rows:
df2.loc[[('A', 0),('A', 0)]]
col1 col2 col3
level0 level1
A 0 0 1 2
0 0 1 2
This is the general case of the specific case that indexing on single indexes duplicates rows. All indexing on single indexes is complete, because there's only one level.
If you want to get duplicate rows through indexing, and not by concating or joining, you can build a complete index list like this:
complete = [(first, second) for first in new_slices for second in set(df2.index.get_level_values(1))]
df2.loc[complete]
col1 col2 col3
level0 level1
D 0 18 19 20
1 21 22 23
0 18 19 20
1 21 22 23
A 0 0 1 2
1 3 4 5
B 0 6 7 8
1 9 10 11
Upvotes: 2
Reputation: 22031
try in this way
index1 = list(s.ascii_uppercase[:4])
index2 = np.arange(2)
col_names='col1 col2 col3'.split()
new_slices = list('DDAB') # note order and repition of labels
multi_index = pd.MultiIndex.from_product([index1,index2],names=["level0","level1"])
data = np.arange(len(index1)*len(index2)*len(col_names))
data=data.reshape(len(index1)*len(index2),-1)
df2 = pd.DataFrame(data,columns=col_names,index=multi_index)
df2.unstack().loc[new_slices].stack() # <=== this does the trick
Upvotes: 2