Reputation: 410
I have historical data of quantity and Amount (how much was charged in the transaction) for items sold by a company to many different customers. I am looking to do some time series analysis on this data, but want to do it at item-customer level.
Here is my raw data:
Year Month Day Qty Amount Item Customer
0 2003 9 1 30.0 220.80 N2719 3110361
1 2003 9 1 1.0 75.17 X1046 3126034
2 2003 9 1 240.0 379.20 D5853 0008933
3 2003 9 1 2112.0 2787.84 D5851 0008933
4 2003 9 1 3312.0 4371.84 D5851 0008933
...
...
<2.7M rows>
This is transactions data sorted by Year/Month/Day and captures which item was sold to what customer and the Quantity and Amount in that sale.
Since I am looking to analyze the time series by Item and Customer, I apply a MultiIndex to it:
df.set_index(['Item', 'Customer', 'Year', 'Month', 'Day'], inplace=True, drop=True)
df.sortlevel(inplace=True)
This gives me a nicely sorted dataframe that looks like:
Item Customer Year Month Day Qty Amount
X1046 3126034 2003 9 1 1.0 75.17
< ... other transactions for X1046/3126034 item/customer combination ...>
3126035 2005 1 2 50.0 500.00
< ... other transactions for X1046/3126035 item/customer combination ...>
< ... 48 other customers for X1046 ...>
N2719 3110361 2003 9 1 30.0 220.80
< ... other transactions for N2719/3110361 item/customer combination ...>
3110362 2004 9 10 9.0 823.00
< ... other transactions for N2719/3110362 item/customer combination ...>
< ... 198 other customers for N2719 ... >
< ... 6998 other items ... >
As you can see, since I have 7,000 different items, and each can have several tens or hundreds of customers, I would like to focus on only those items that have a substantial customer base. There are many items in the dataset that were maybe purchased by 1 customer sometime in the past, and may have been discontinued, etc.
So use the following to get items sorted by number of customers:
item_by_customers = df.reset_index().groupby('Item')['Customer'].nunique().sort_values(ascending=False)
Which gives me the items sorted by number of customers as a pandas Series:
Item
N2719 200
X1046 50
<... 6998 other rows ...>
Now I want to apply this sort order to my DataFrame, so data for item N2719 shows up first (retaining all the levels of the MultiIndex within it), followed by X1046, and so on.
I am unable to figure out how to make that happen.
Here's what I've tried so far:
sorted_data = df.set_index(item_by_customers.index)
< ... gives me ValueError: Length mismatch: Expected axis has 2.7M elements, new values have 7000 elements ...>
I can see why I get this error, because I have 7,000 items in the index, and 2.7M rows in the DataFrame.
I also tried reindex:
sorted_data = df.reindex(index=item_by_customers.index, columns=['Item'])
< ... gives me Exception: cannot handle a non-unique multi-index! ...>
There is also a sort_index()
which essentially sorts an index column based on its own values, not based on some other criteria.
I am looking for some guidance on how to apply item_by_customers.index
to the DataFrame, so I get a DataFrame that looks like this:
Item Customer Year Month Day Qty Amount
N2719 3110361 2003 9 1 30.0 220.80
< ... other transactions for N2719/3110361 item/customer combination ...>
3110362 2004 9 10 9.0 823.00
< ... other transactions for N2719/3110362 item/customer combination ...>
< ... 198 other customers for N2719 ... >
X1046 3126034 2003 9 1 1.0 75.17
< ... other transactions for X1046/3126034 item/customer combination ...>
3126035 2005 1 2 50.0 500.00
< ... other transactions for X1046/3126035 item/customer combination ...>
< ... 48 other customers for X1046 ...>
< ... 6998 other items ... >
Upvotes: 2
Views: 197
Reputation: 1371
Here is how you can achieve what you are looking for:
import pandas as pd
df = pd.DataFrame({
'Item':['X1046','X1046','N2719','N2719','N2719'],
'Customer':['3126034','3126035','3110361','3110362','3110363'],
'Year':[2003,2005,2003,2004,2004],
'Month':[9,1,9,9,9],
'Day':[1,2,1,10,10],
'Qty':[1,50,30,9,9],
'Amount':[75.17,500,220,823,823]
})
df.set_index(['Item', 'Customer', 'Year', 'Month', 'Day'], inplace=True, drop=True)
df.sort_index(inplace=True)
item_by_customers = df.reset_index().groupby('Item')['Customer'].nunique().sort_values(ascending=False).rename('Unique_Customers')
df = df.join(item_by_customers, on='Item').sort_values('Unique_Customers', ascending=False)
print(df)
This gives the output as:
Qty Amount Unique_Customers
Item Customer Year Month Day
N2719 3110361 2003 9 1 30 220.00 3
3110362 2004 9 10 9 823.00 3
3110363 2004 9 10 9 823.00 3
X1046 3126034 2003 9 1 1 75.17 2
3126035 2005 1 2 50 500.00 2
So, the basic strategy is to add the unique counts of customers as a column to the original dataframe and then sort by that as you wish.
Upvotes: 0
Reputation: 294508
transform
df.assign(nu=df.groupby('Item').Customer.transform('nunique')) \
.sort_values(['nu', 'Item'], ascending=[False, True])
Upvotes: 1