Reputation: 139
From the data frame, I am trying to use the 'mean' column to separate the values into 3 bins.
num_countries mean
0 'Europe', 25 161.572326
1 'Asia', 7 607.983830
2 'North America', 3 1560.438095
3 'South America', 2 199.148901
4 'Australia', 1 218.021429
5 'Africa' 1 213.846154
6 'Oceania', 1 39.378571
my bins are
bins = [-np.inf, (in_order['mean'].mean()-in_order['mean'].std()), (in_order['mean'].mean()+in_order['mean'].std()), np.inf]
which results to [-inf, -100.38831237389581, 957.64239998696303, inf]
Then when I try to put them into the bins, this is what happens.
binned = pd.cut(in_order.mean, bins)
TypeErrorTraceback (most recent call last)
<ipython-input-229-3343eeaf99d6> in <module>()
----> 1 binned = pd.cut(in_order.mean, bins)
C:\Users\zkrumlinde\AppData\Local\Enthought\Canopy32\edm\envs\User\lib\site-packages\pandas\tools\tile.pyc in cut(x, bins, right, labels, retbins, precision, include_lowest)
117 return _bins_to_cuts(x, bins, right=right, labels=labels,
118 retbins=retbins, precision=precision,
--> 119 include_lowest=include_lowest)
120
121
C:\Users\zkrumlinde\AppData\Local\Enthought\Canopy32\edm\envs\User\lib\site-packages\pandas\tools\tile.pyc in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
222
223 levels = np.asarray(levels, dtype=object)
--> 224 np.putmask(ids, na_mask, 0)
225 fac = Categorical(ids - 1, levels, ordered=True, fastpath=True)
226 else:
TypeError: putmask() argument 1 must be numpy.ndarray, not numpy.int32
Upvotes: 1
Views: 450
Reputation: 294586
I'd use np.searchsorted
x = in_order['mean'].values
sig = x.std()
mu = x.mean()
in_order.assign(bins=np.searchsorted([mu - sig, mu + sig], x))
continent num_countries mean bins
0 Europe 25 161.572326 1
1 Asia 7 607.983830 1
2 North America 3 1560.438095 2
3 South America 2 199.148901 1
4 Australia 1 218.021429 1
5 Africa 1 213.846154 1
6 Oceania 1 39.378571 1
We can do that with labels if you'd like
x = in_order['mean'].values
sig = x.std()
mu = x.mean()
labels = np.array(['< μ - σ', 'μ ± σ', '> μ + σ'])
in_order.assign(bins=labels[np.searchsorted([mu - sig, mu + sig], x)])
continent num_countries mean bins
0 Europe 25 161.572326 μ ± σ
1 Asia 7 607.983830 μ ± σ
2 North America 3 1560.438095 > μ + σ
3 South America 2 199.148901 μ ± σ
4 Australia 1 218.021429 μ ± σ
5 Africa 1 213.846154 μ ± σ
6 Oceania 1 39.378571 μ ± σ
Upvotes: 1
Reputation: 403278
Starting with your data:
print(df)
continent num_countries mean
0 Europe 25 161.572326
1 Asia 7 607.983830
2 North America 3 1560.438095
3 South America 2 199.148901
4 Australia 1 218.021429
5 Africa 1 213.846154
6 Oceania 1 39.378571
I believe the main problem is the manner in which you reference the mean
column. Do note that mean
is also a first order function on a pd.DataFrame
object. Observe:
print(df.mean)
<bound method DataFrame.mean of ....>
If you want to access the mean
column (and not the mean
function), you'll need to do so with df['mean']
.
s = pd.cut(in_order['mean'], bins)
print(s)
0 (-100.388, 957.642]
1 (-100.388, 957.642]
2 (957.642, inf]
3 (-100.388, 957.642]
4 (-100.388, 957.642]
5 (-100.388, 957.642]
6 (-100.388, 957.642]
Name: mean, dtype: category
Categories (3, interval[float64]): [(-inf, -100.388] < (-100.388, 957.642] < (957.642, inf]]
print(s.cat.codes)
0 1
1 1
2 2
3 1
4 1
5 1
6 1
dtype: int8
Alternatively, have you considered pd.qcut
? You can quite simply pass the number of bins and your data will be binned into that many quantiles.
s = pd.qcut(df['mean'], 4)
print(s)
0 (39.378, 180.361]
1 (413.003, 1560.438]
2 (413.003, 1560.438]
3 (180.361, 213.846]
4 (213.846, 413.003]
5 (180.361, 213.846]
6 (39.378, 180.361]
Name: mean, dtype: category
Categories (4, interval[float64]): [(39.378, 180.361] < (180.361, 213.846] < (213.846, 413.003] <
(413.003, 1560.438]]
print(s.cat.codes)
0 0
1 3
2 3
3 1
4 2
5 1
6 0
dtype: int8
Your method above bins most of your data to a single category, so I believe this should work better for you.
Upvotes: 2