Lukasz Siemiradzki
Lukasz Siemiradzki

Reputation: 336

How I can dynamically define bins for grouped data using pandas?

Having a dataset, I have to group it in several ways (with MultiIndex), perform some aggregations and export results. One of these operations is binning (bucketing) on column with prices to obtain amount of elements in each bucket. I need 3 buckets where:

Example:

Product Country Sell col price colb
First DE A b 100 x
Second DE A g 105 z
First FR A b 111 x
Second FR A g 100 z
First DE B b 109 x
Second DE B g 120 z
First FR B b 100 x
Second FR B g 200 z

What I expect:

Product Country Sell 1x 1.1x >1.5x
First DE A 1 0 0
B 0 1 0
FR A 0 1 0
B 1 0 0
Second DE A 1 0 0
B 0 1 0
FR A 1 0 0
B 0 0 1

Now, what I'm trying to do is following:

import numpy as np
import pandas as pd

# some code

df_low_price = df.groupby(["Product","Country","Sell"])["price"].sort_values(by="price").nth(0)
df_low_price_1_1x = df_low_price.map(lambda n: n * 1.1)
df_low_price_1_5x = df_low_price.map(lambda n: n * 1.5)

df_main = pd.concat([
df_low_price,
df_low_price_1_1x,
df_low_price_1_5x, axis=1
])

I'm getting the boundaries, but I don't get the size of the groups. I know that I should rely on pd.cut or cut, but I have no idea how to do it pythonic/pandas way. Thank you in advance for any suggestions.

Upvotes: 0

Views: 938

Answers (1)

Rob Raymond
Rob Raymond

Reputation: 31156

  • just followed what you describe
  • to bin cut
  • multi-index and columns groupby/agg and unstack()
df = pd.DataFrame({'Product': ['First',
  'Second',
  'First',
  'Second',
  'First',
  'Second',
  'First',
  'Second'],
 'Country': ['DE', 'DE', 'FR', 'FR', 'DE', 'DE', 'FR', 'FR'],
 'Sell': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
 'col': ['b', 'g', 'b', 'g', 'b', 'g', 'b', 'g'],
 'price': [100, 105, 111, 100, 109, 120, 100, 200],
 'colb': ['x', 'z', 'x', 'z', 'x', 'z', 'x', 'z'],
 'bin': ['1x', '1x', '1.1x', '1x', '1x', '1.1x', '1x', '>1.5x']})

df["bin"] = pd.cut(
    df["price"],
    bins=[
        df["price"].min() - 1,
        df["price"].min() * 1.1,
        df["price"].min() * 1.5,
        df["price"].max(),
    ],
    labels=["1x", "1.1x", ">1.5x"],
)

df.groupby(["Product","Country","Sell","bin"]).agg({"col":"count"}).unstack().droplevel(0,1)

output

bin                   1x  1.1x  >1.5x
Product Country Sell                 
First   DE      A      1     0      0
                B      1     0      0
        FR      A      0     1      0
                B      1     0      0
Second  DE      A      1     0      0
                B      0     1      0
        FR      A      1     0      0
                B      0     0      1

Upvotes: 1

Related Questions