Reputation: 180

What's the most efficient way of binning and reduction in python?

I need an efficient way of first binning an array into different groups, then reducing the binned values into the mean of each category.

I've suspect numpy and pandas are the best modules to use, so I've implemented a naive method but can't find a more efficient way that utilises numpy's rapid manipulations for every step.

Example

import numpy as np
import pandas as pd

# Create a fake set of data.
df = pd.DataFrame({"to_query": np.random.random(100),
                   "to_sum": np.random.random(100)})

# Create some bins and assign each value to the right bin.
bins = np.linspace(0, 1, 11)
df["bin"] = pd.cut(df["to_query"], bins)

# Sorting by values in bin probably speeds up assignment.
df = df.sort_values(by="bin")

# Different bin categories.
unique_cats = np.unique(df["bin"])

# Here is where np and pd starts to be limited.
cats = {cat: [] for cat in unique_cats}
for index, row in df.iterrows():
    cats[row["bin"]].append(row["to_query"])
cats = {cat: np.mean(vals) for (cat, vals) in cats.items()}

Output

{Interval(0.0, 0.1, closed='right'): 0.044511160008147525, Interval(0.1, 0.2, closed='right'): 0.15604870269444393, Interval(0.2, 0.3, closed='right'): 0.2622453825250092, Interval(0.3, 0.4, closed='right'): 0.35781166427200817, Interval(0.4, 0.5, closed='right'): 0.4520571189836645, Interval(0.5, 0.6, closed='right'): 0.5516216469609114, Interval(0.6, 0.7, closed='right'): 0.6487768161741378, Interval(0.7, 0.8, closed='right'): 0.7640266198159613, Interval(0.8, 0.9, closed='right'): 0.8468250462224601, Interval(0.9, 1.0, closed='right'): 0.9633520862829202}`

Upvotes: 0

Answers (2)

Elias Kassell

Reputation: 180

Best Method - Time Comparison

In order with best first, using

import numpy as np
import pandas as pd

# Create a fake set of data.
df = pd.DataFrame({"to_query": np.random.random(1 * 10 ** 4), "to_sum": np.random.random(1 * 10 ** 4)})
df["bin"] = pd.cut(df["to_query"], np.linspace(0, 1, 11))
df = df.sort_values(by="bin")

as prerequisite.

@Code Different

%%time
df = df.groupby('bin').mean()

Wall time: 7.22 ms

@CJR

%%time
def make_bins(series, bins):
    min_v, max_v = np.min(series), np.max(series)
    epsilon = max(np.finfo(float).eps, np.finfo(float).eps * (min_v - max_v))
    return np.floor((series - min_v) / (max_v - min_v + epsilon) * bins)

df.loc[:, ["bins", "to_sum"]].groupby("bins").agg('mean')

Wall time: 11.9 ms

Original method

%%time
# Different bin categories.
unique_cats = np.unique(df["bin"])

# Here is where np and pd starts to be limited.
cats = {cat: [] for cat in unique_cats}
for index, row in df.iterrows():
    cats[row["bin"]].append(row["to_query"])
cats = {cat: np.mean(vals) for (cat, vals) in cats.items()}

Wall time: 2.2 s

Significant time improvement, thanks, and much more concise!

Upvotes: 1

CJR

Reputation: 3985

This is how I would do it

import numpy as np
import pandas as pd

# Create a fake set of data.
df = pd.DataFrame({"to_query": np.random.random(100),
                   "to_sum": np.random.random(100)})

Then:

def make_bins(series, bins):
    min_v, max_v = np.min(series), np.max(series)
    epsilon = max(np.finfo(float).eps, np.finfo(float).eps * (min_v - max_v))
    return np.floor((series - min_v) / (max_v - min_v + epsilon) * bins)

df["bins"] = make_bins(df["to_query"], 11)

df.groupby("bins").agg('mean')

Results in:

      to_query    to_sum
bins                    
0.0   0.047117  0.554289
1.0   0.161922  0.521029
2.0   0.226992  0.465175
3.0   0.327877  0.592192
4.0   0.420162  0.359697
5.0   0.504586  0.547049
6.0   0.585511  0.350083
7.0   0.685560  0.677394
8.0   0.772207  0.606797
9.0   0.866236  0.516578
10.0  0.946512  0.547876