Reputation: 11474
I have this problem that I cannot get around. I have this dataframe:
item distance
0 1 0
1 2 1
2 3 1
3 4 3
4 5 4
5 6 4
6 7 5
7 8 6
8 9 7
9 10 7
10 11 7
11 12 7
12 13 8
13 14 8
14 15 20
15 16 20
and I need to associate each row to an interval. So, I thought about creating "bins" this way:
max_distance = df['distance'].max()
min_distance = df['distance'].min()
number_bins = (round(max_distance)-round(min_distance))/0.5
This means that each interval has lenght 0.5. This creates 40 "bin". But this is where I get stuck. I do not know how to
(0,0.5], (0.5,1],(1,1.5] ,(1.5,2],(2,2.5] ,(2,5,3]......
and give each of them a name 1, 2, 3, ...., 40
df['distance']
to a specific interval number (from 1.)item distance bin
0 1 0 1
1 2 1 2
2 3 1 2
3 4 3 6
4 5 4 6
5 6 4 #and so on
6 7 5
7 8 6
8 9 7
9 10 7
10 11 7
11 12 7
12 13 8
13 14 8
14 15 20
15 16 20
Now, I tried someting using pd.cut
but doing so:
bins_df = pd.cut(df['distance'], round(number_bins))
bins_unique = bins.unique()
return interval with gaps and not enough categories
[(-0.02, 0.155], (0.93, 1.085], (2.946, 3.101], (3.876, 4.031], (4.961, 5.116], (5.891, 6.047], (6.977, 7.132], (7.907, 8.062], (19.845, 20.0]]
Categories (9, interval[float64]): [(-0.02, 0.155] < (0.93, 1.085] < (2.946, 3.101] < (3.876, 4.031] ... (5.891, 6.047] < (6.977, 7.132] < (7.907, 8.062] < (19.845, 20.0]]
idealy, I would associate every value distance
to a category in [1, number_bins]
Any idea on how I could achieve my desired output would be greatly appreciated.
Upvotes: 0
Views: 407
Reputation: 11395
You seemed on the right track with the 2 steps you specified. Here’s how I would carry them out:
import numpy as np
bounds = np.arange(df['distance'].min(), df['distance'].max() + .5, .5)
bounds
np.arange
is mostly like range()
, but you can specify floating-point bounds and step.+.5
ensures you get the final bound.This gives you the following:
array([ 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ,
6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5, 10. , 10.5, 11. , 11.5,
12. , 12.5, 13. , 13.5, 14. , 14.5, 15. , 15.5, 16. ])
dist_bins = pd.cut(df['distance'], bins=bounds, include_lowest=True)
dist_bins
This uses the fact that you can specify the bins manually, see the doc:
bins : int, sequence of scalars, or IntervalIndex
The criteria to bin by.
- [...]
- sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.
Which returns:
0 (0.999, 1.5]
1 (1.5, 2.0]
2 (2.5, 3.0]
3 (3.5, 4.0]
4 (4.5, 5.0]
5 (5.5, 6.0]
6 (6.5, 7.0]
7 (7.5, 8.0]
8 (8.5, 9.0]
9 (9.5, 10.0]
10 (10.5, 11.0]
11 (11.5, 12.0]
12 (12.5, 13.0]
13 (13.5, 14.0]
14 (14.5, 15.0]
15 (15.5, 16.0]
Name: distance, dtype: category
Categories (30, interval[float64]): [(0.999, 1.5] < (1.5, 2.0] < (2.0, 2.5] < (2.5, 3.0] < ... <
(14.0, 14.5] < (14.5, 15.0] < (15.0, 15.5] < (15.5, 16.0]]
Note that as per your specification of bins the distance 1
would not fall in any bin, which is why I used include_lowest=True
and why the first bin looks like (0.999, 1.5]
(which is basically [1, 1.5]
). If you don’t want this you need to start bins below your min()
You get a (sorted) category
dtype column (pd.Series
) as expected.
If you want the list of the 30 categories that were created, you can access them with the .cat
accessor
dist_bins.cat.categories
This returns an IntervalIndex
:
IntervalIndex([(0.999, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5] ... (13.5, 14.0], (14.0, 14.5], (14.5, 15.0], (15.0, 15.5], (15.5, 16.0]],
closed='right',
dtype='interval[float64]')
As with every index you can access the list of values:
>>> dist_bins.cat.categories.to_list()
[Interval(0.999, 1.5, closed='right'), Interval(1.5, 2.0, closed='right'), Interval(2.0, 2.5, closed='right'), Interval(2.5, 3.0, closed='right'), Interval(3.0, 3.5, closed='right'), Interval(3.5, 4.0, closed='right'), Interval(4.0, 4.5, closed='right'), Interval(4.5, 5.0, closed='right'), Interval(5.0, 5.5, closed='right'), Interval(5.5, 6.0, closed='right'), Interval(6.0, 6.5, closed='right'), Interval(6.5, 7.0, closed='right'), Interval(7.0, 7.5, closed='right'), Interval(7.5, 8.0, closed='right'), Interval(8.0, 8.5, closed='right'), Interval(8.5, 9.0, closed='right'), Interval(9.0, 9.5, closed='right'), Interval(9.5, 10.0, closed='right'), Interval(10.0, 10.5, closed='right'), Interval(10.5, 11.0, closed='right'), Interval(11.0, 11.5, closed='right'), Interval(11.5, 12.0, closed='right'), Interval(12.0, 12.5, closed='right'), Interval(12.5, 13.0, closed='right'), Interval(13.0, 13.5, closed='right'), Interval(13.5, 14.0, closed='right'), Interval(14.0, 14.5, closed='right'), Interval(14.5, 15.0, closed='right'), Interval(15.0, 15.5, closed='right'), Interval(15.5, 16.0, closed='right')]
Upvotes: 1