python pandas groupby after sorted multindex not correct

Question

Dear python/pandas experts

I am having problems with a sort on a index of a multiindex pandas dataframe. More specifically, it seems that the sort works, but that a groupby operation afterwards is 'forgetting' the sort again. For the reference: I am running python 3.4.3 (by anaconda) and pandas 0.16.2 (np19py34_0).

Now in detail what I am trying to do.

I have a multindex dataframe created as

import pandas as pd
label="sdjks"
sidechar="B"
mi_level_fields = (label, sidechar)
mi_level_names = ["Label", "Side"]
pipeinfo_index = pd.MultiIndex.from_tuples([mi_level_fields], names=mi_level_names)

pipeinfoDF = pd.DataFrame(index=pipeinfo_index, columns=[])
pipeinfoDF.ix[(label, sidechar), "Nc"] = 10
pipeinfoDF.ix[(label, "C"), "Nc"] = 10
pipeinfoDF.ix[("ztest", "C"), "Nc"] = 400
pipeinfoDF.ix[("ztest", "B"), "Nc"] = 400
pipeinfoDF.ix[("yaki", "B"), "Nc"] = 1
pipeinfoDF.ix[("yaki", "C"), "Nc"] = 1

This pipeinfoDF data frame looks now like

             Nc
Label Side     
sdjks B      10
      C      10
ztest C     400
      B     400
yaki  B       1
      C       1

Now I want to sort the indices of the dataframe such that the NC columns is in ascending order. This can be done with

pipeinfoDF.sort_index(by=["Nc"], inplace=True, ascending=True)

Which indeed correctly yields using

print(pipeinfoDF.head())

             Nc
Label Side     
yaki  B       1
      C       1
sdjks B      10
      C      10
ztest C     400
      B     400

The problem, however, appears when I want to loop over the rows of this multindex dataframe, which I normally do with

for (label, df) in pipeinfoDF.groupby(level=0, sort=False):
    side_list = df.index.get_level_values('Side')
    for side in side_list:
        data = pipeinfoDF.ix[(label, side)]
        print(label, side, data.Nc)

Which gives now as an output

sdjks B 10.0
sdjks C 10.0
ztest C 400.0
ztest B 400.0
yaki B 1.0
yaki C 1.0

as you an see, although the head() statement shows that the dataframe has been sorted correctly, while looping over the indices (which I normally do to copy the data to another table) does not seem to use the correct sorted indices.

This to me seems to be a bug: the sort option in the groupby statement has no effect on the result, and something similar has been reported here as well.

Now my question is: is there an easy way to fix this ? The head statement seems to be correctly giving my my sorted multindex dataframe, so I have been trying to make a copy of the output of this head like

result = pipeinfoDF.head()

but this not seems to work.

My only last attempt tries to create a new data frame based on the resetted indices:

tmp = pipeinfoDF.copy()
tmp.reset_index(inplace=True)

lbls = tmp.Label.values
sds  = tmp.Side.values

pipeinfo_index2 = pd.MultiIndex.from_tuples(list(zip(lbls,sds)), names=mi_level_names)
pipeinfoDF2 = pd.DataFrame(index=pipeinfo_index2, columns=[])

for index, row in tmp.iterrows():
    for col in tmp.columns[2:]:
        pipeinfoDF2.ix[(row["Label"], row["Side"]), col] = row[col]

Again with head() I get the correct results

             Nc
Label Side     
yaki  B       1
      C       1
sdjks B      10
      C      10
ztest C     400

But with the loop over the multindex frame as earlier it is again sorting on the first group, which I explicitly prevented with sort=False

for (label, df) in pipeinfoDF2.groupby(level=0, sort=False):
    side_list = df.index.get_level_values('Side')
    for side in side_list:
        data = pipeinfoDF2.ix[(label, side)]
        print(label, side, data.Nc)

This give

sdjks B 10.0
sdjks C 10.0
yaki B 1.0
yaki C 1.0
ztest C 400.0
ztest B 400.0

So the groupby option again seems to sort on the first index.

Edit: I found the following to fix this. If you print the index of the dataframe it has labels not in numerical order:

print(pipeinfoDF2.index)

MultiIndex(levels=[['sdjks', 'yaki', 'ztest'], ['B', 'C']],
           labels=[[1, 1, 0, 0, 2, 2], [0, 1, 0, 1, 1, 0]],
           names=['Label', 'Side'])

Here the levels are 'sdjks', 'yaki' 'ztest' and the labels correspond with the order 1,1,0,0,2,2 The same can be seen when plotting the index of the first sorted pipeinfoDF, where the sort_index keeps the order of the levels in MultiIndex, but only changes the order of the labels.

I can therefore fix my problem by forcing the labels to run as 0,0,1,1,2,2, because apparently the groupby ignores the order of the labels and always picks the order of the levels. My fix is therefore

pipeinfo_index2 = pd.MultiIndex.from_tuples([tuples[0]], names=mi_level_names)
pipeinfoDF2 = pd.DataFrame(index=pipeinfo_index2, columns=[])

and then to fill the rest of the fields as I did before. In this way, the multindex looks as

MultiIndex(levels=[['yaki', 'sdjks', 'ztest'], ['B', 'C']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 1, 0]],
           names=['Label', 'Side'])

And now looping over the rows with groupby give the following output

yaki B 1.0
yaki C 1.0
sdjks B 10.0
sdjks C 10.0
ztest C 400.0
ztest B 400.0

Which is correct!

So I have found a very ugly work around: after the sort, copy the whole dataframe to a new one, reset the index, and copy everything back to enforce the MultiIndex labels into numerical order. But I think this is very inefficient and generates lots of code and I am sure it is possible to do this more efficient.

My question therefore is: is there a way to loop over the rows of a multiindex data frame which takes into account the labels order ? It is clearly ignored. Am I missing something? Hopefully there is an easier way to do this.

Any hints appreciated!

Edit:

The suggestion of Firelynx works. If I do

for (label,side) in pipeinfoDF.index:
    data = pipeinfoDF.ix[(label, side)]
    print(label, side, data.Nc)

after the first sort I correctly get the data in the sorted order

yaki B 1.0
yaki C 1.0
sdjks B 10.0
sdjks C 10.0
ztest C 400.0
ztest B 400.0

Saves my a lot of coding. The question remaining nevertheless is: should groupby with the sort=False option on not yield the same result? Is this a bug, or am I miss using the groupby way to loop over my data frame. It is based on examples I googled around, but it should be used with care. Anyway, for now I solver my problem and I am going to abandon my groupby way of accessing data.

Edit:

The solution of Firelynx works, however, it is not taking into account the multilevel structure into account anymore, but just puts all the Label and side levels into one list.

In order to get something very similar as I want to acchieve with the groupby method, I now do the following hack

label_list = []
for (label,side) in pipeinfoDF.index:
    if not label in label_list:
        label_list.append(label)

for label in label_list:
    df = pipeinfoDF.loc[label]
    side_list = df.index.get_level_values('Side')
    for side in side_list:
        data = pipeinfoDF.ix[(label, side)]
        print(label, side, data.Nc)

Which correctly yields

yaki B 1.0
yaki C 1.0
sdjks B 10.0
sdjks C 10.0
ztest C 400.0
ztest B 400.0

So I extract the sorted label list first using the suggestion of Firelynx, and then loop over this list to get the side per label and do something I want to do with it. Although this is much cleaner than my first approach, still I have the feeling it could be done more directly somehow. I can not imaging that you can not use the groupby method on sorted multiindex dataframes without messing up the sort order. Perhaps someone has a suggestion? Anyway, for now I am happy with the solution

Based on the latest suggestion of Firelynx I have a small update which makes it a bit cleaner. Still, you need to keep a list in order to prevent double counts of the label, because unique apply only to a unique (label,side) combination. So I now have

label_list = []
for (label, side) in pipeinfoDF.index.unique():
    if not label in label_list:
        label_list.append(label)
    else:
        continue
    df = pipeinfoDF.loc[label]
    side_list = df.index.get_level_values('Side')
    for side in side_list:
        data = pipeinfoDF.ix[(label, side)]
        print(label, side, data.Nc)

Is it possible to have unique() apply on the label alone? Then I could remove the label_list to keep track of which label has been processed already

firelynx · Accepted Answer

Your for loop goes over .groupby(level=0, ..., you are just grouping on level=0 and therefore your resulting dataset will be sorted only on the first level of your index.

You can probably do:

for label in pipedinfoDF.index.unique():
    group = pipedinfoDF.loc[label]

to get the order you want.

python pandas groupby after sorted multindex not correct

Answers (1)

Related Questions