Reputation: 180
I need an efficient way of first binning an array into different groups, then reducing the binned values into the mean of each category.
I've suspect numpy and pandas are the best modules to use, so I've implemented a naive method but can't find a more efficient way that utilises numpy's rapid manipulations for every step.
import numpy as np
import pandas as pd
# Create a fake set of data.
df = pd.DataFrame({"to_query": np.random.random(100),
"to_sum": np.random.random(100)})
# Create some bins and assign each value to the right bin.
bins = np.linspace(0, 1, 11)
df["bin"] = pd.cut(df["to_query"], bins)
# Sorting by values in bin probably speeds up assignment.
df = df.sort_values(by="bin")
# Different bin categories.
unique_cats = np.unique(df["bin"])
# Here is where np and pd starts to be limited.
cats = {cat: [] for cat in unique_cats}
for index, row in df.iterrows():
cats[row["bin"]].append(row["to_query"])
cats = {cat: np.mean(vals) for (cat, vals) in cats.items()}
{Interval(0.0, 0.1, closed='right'): 0.044511160008147525, Interval(0.1, 0.2, closed='right'): 0.15604870269444393, Interval(0.2, 0.3, closed='right'): 0.2622453825250092, Interval(0.3, 0.4, closed='right'): 0.35781166427200817, Interval(0.4, 0.5, closed='right'): 0.4520571189836645, Interval(0.5, 0.6, closed='right'): 0.5516216469609114, Interval(0.6, 0.7, closed='right'): 0.6487768161741378, Interval(0.7, 0.8, closed='right'): 0.7640266198159613, Interval(0.8, 0.9, closed='right'): 0.8468250462224601, Interval(0.9, 1.0, closed='right'): 0.9633520862829202}`
Upvotes: 0
Views: 1717
Reputation: 180
In order with best first, using
import numpy as np
import pandas as pd
# Create a fake set of data.
df = pd.DataFrame({"to_query": np.random.random(1 * 10 ** 4), "to_sum": np.random.random(1 * 10 ** 4)})
df["bin"] = pd.cut(df["to_query"], np.linspace(0, 1, 11))
df = df.sort_values(by="bin")
as prerequisite.
%%time
df = df.groupby('bin').mean()
Wall time: 7.22 ms
%%time
def make_bins(series, bins):
min_v, max_v = np.min(series), np.max(series)
epsilon = max(np.finfo(float).eps, np.finfo(float).eps * (min_v - max_v))
return np.floor((series - min_v) / (max_v - min_v + epsilon) * bins)
df.loc[:, ["bins", "to_sum"]].groupby("bins").agg('mean')
Wall time: 11.9 ms
%%time
# Different bin categories.
unique_cats = np.unique(df["bin"])
# Here is where np and pd starts to be limited.
cats = {cat: [] for cat in unique_cats}
for index, row in df.iterrows():
cats[row["bin"]].append(row["to_query"])
cats = {cat: np.mean(vals) for (cat, vals) in cats.items()}
Wall time: 2.2 s
Significant time improvement, thanks, and much more concise!
Upvotes: 1
Reputation: 3985
This is how I would do it
import numpy as np
import pandas as pd
# Create a fake set of data.
df = pd.DataFrame({"to_query": np.random.random(100),
"to_sum": np.random.random(100)})
Then:
def make_bins(series, bins):
min_v, max_v = np.min(series), np.max(series)
epsilon = max(np.finfo(float).eps, np.finfo(float).eps * (min_v - max_v))
return np.floor((series - min_v) / (max_v - min_v + epsilon) * bins)
df["bins"] = make_bins(df["to_query"], 11)
df.groupby("bins").agg('mean')
Results in:
to_query to_sum
bins
0.0 0.047117 0.554289
1.0 0.161922 0.521029
2.0 0.226992 0.465175
3.0 0.327877 0.592192
4.0 0.420162 0.359697
5.0 0.504586 0.547049
6.0 0.585511 0.350083
7.0 0.685560 0.677394
8.0 0.772207 0.606797
9.0 0.866236 0.516578
10.0 0.946512 0.547876
Upvotes: 1