Cod.ie
Cod.ie

Reputation: 410

How to sort MultiIndex level by number of rows in the child level

I have historical data of quantity and Amount (how much was charged in the transaction) for items sold by a company to many different customers. I am looking to do some time series analysis on this data, but want to do it at item-customer level.

Here is my raw data:

      Year         Month   Day      Qty           Amount     Item   Customer
0     2003         9       1         30.0         220.80     N2719  3110361
1     2003         9       1          1.0          75.17     X1046  3126034
2     2003         9       1        240.0         379.20     D5853  0008933
3     2003         9       1       2112.0        2787.84     D5851  0008933
4     2003         9       1       3312.0        4371.84     D5851  0008933
...
...
<2.7M rows>

This is transactions data sorted by Year/Month/Day and captures which item was sold to what customer and the Quantity and Amount in that sale.

Since I am looking to analyze the time series by Item and Customer, I apply a MultiIndex to it:

df.set_index(['Item', 'Customer', 'Year', 'Month', 'Day'], inplace=True, drop=True)
df.sortlevel(inplace=True)

This gives me a nicely sorted dataframe that looks like:

Item      Customer     Year   Month   Day   Qty      Amount
X1046     3126034      2003   9       1     1.0      75.17
                       < ...  other transactions for X1046/3126034 item/customer combination ...>
          3126035      2005   1       2     50.0     500.00
                        < ...  other transactions for X1046/3126035 item/customer combination ...>
      < ... 48 other customers for X1046 ...>

N2719     3110361      2003    9      1     30.0      220.80   
                       < ...  other transactions for N2719/3110361 item/customer combination ...>
          3110362      2004    9      10     9.0     823.00
                       < ...  other transactions for N2719/3110362 item/customer combination ...>
      < ... 198 other customers for N2719 ... >
< ... 6998 other items ... >

As you can see, since I have 7,000 different items, and each can have several tens or hundreds of customers, I would like to focus on only those items that have a substantial customer base. There are many items in the dataset that were maybe purchased by 1 customer sometime in the past, and may have been discontinued, etc.

So use the following to get items sorted by number of customers:

item_by_customers = df.reset_index().groupby('Item')['Customer'].nunique().sort_values(ascending=False)

Which gives me the items sorted by number of customers as a pandas Series:

Item
N2719    200
X1046     50
<... 6998 other rows ...>

Now I want to apply this sort order to my DataFrame, so data for item N2719 shows up first (retaining all the levels of the MultiIndex within it), followed by X1046, and so on.

I am unable to figure out how to make that happen.

Here's what I've tried so far:

sorted_data = df.set_index(item_by_customers.index)
< ... gives me ValueError: Length mismatch: Expected axis has 2.7M elements, new values have 7000 elements ...>

I can see why I get this error, because I have 7,000 items in the index, and 2.7M rows in the DataFrame.

I also tried reindex:

sorted_data = df.reindex(index=item_by_customers.index, columns=['Item'])
< ... gives me Exception: cannot handle a non-unique multi-index! ...>

There is also a sort_index() which essentially sorts an index column based on its own values, not based on some other criteria.

I am looking for some guidance on how to apply item_by_customers.index to the DataFrame, so I get a DataFrame that looks like this:

Item      Customer     Year   Month   Day   Qty      Amount
N2719     3110361      2003    9      1     30.0      220.80   
                       < ...  other transactions for N2719/3110361 item/customer combination ...>
          3110362      2004    9      10     9.0     823.00
                       < ...  other transactions for N2719/3110362 item/customer combination ...>
      < ... 198 other customers for N2719 ... >

X1046     3126034      2003   9       1     1.0      75.17
                       < ...  other transactions for X1046/3126034 item/customer combination ...>
          3126035      2005   1       2     50.0     500.00
                        < ...  other transactions for X1046/3126035 item/customer combination ...>
      < ... 48 other customers for X1046 ...>

< ... 6998 other items ... >

Upvotes: 2

Views: 197

Answers (2)

WebDev
WebDev

Reputation: 1371

Here is how you can achieve what you are looking for:

import pandas as pd

df = pd.DataFrame({
    'Item':['X1046','X1046','N2719','N2719','N2719'],
    'Customer':['3126034','3126035','3110361','3110362','3110363'],
    'Year':[2003,2005,2003,2004,2004],
    'Month':[9,1,9,9,9],
    'Day':[1,2,1,10,10],
    'Qty':[1,50,30,9,9],
    'Amount':[75.17,500,220,823,823]
})

df.set_index(['Item', 'Customer', 'Year', 'Month', 'Day'], inplace=True, drop=True)
df.sort_index(inplace=True)

item_by_customers = df.reset_index().groupby('Item')['Customer'].nunique().sort_values(ascending=False).rename('Unique_Customers')

df = df.join(item_by_customers, on='Item').sort_values('Unique_Customers', ascending=False)

print(df)

This gives the output as:

                               Qty  Amount  Unique_Customers
Item  Customer Year Month Day
N2719 3110361  2003 9     1     30  220.00                 3
      3110362  2004 9     10     9  823.00                 3
      3110363  2004 9     10     9  823.00                 3
X1046 3126034  2003 9     1      1   75.17                 2
      3126035  2005 1     2     50  500.00                 2

So, the basic strategy is to add the unique counts of customers as a column to the original dataframe and then sort by that as you wish.

Upvotes: 0

piRSquared
piRSquared

Reputation: 294508

transform

df.assign(nu=df.groupby('Item').Customer.transform('nunique')) \
   .sort_values(['nu', 'Item'], ascending=[False, True])

Upvotes: 1

Related Questions