Assign category or integer to row if value within an interval

I have this problem that I cannot get around. I have this dataframe:

item  distance
0      1         0
1      2         1
2      3         1
3      4         3
4      5         4
5      6         4
6      7         5
7      8         6
8      9         7
9     10         7
10    11         7
11    12         7
12    13         8
13    14         8
14    15        20
15    16        20

and I need to associate each row to an interval. So, I thought about creating "bins" this way:

max_distance = df['distance'].max()
min_distance = df['distance'].min()
number_bins = (round(max_distance)-round(min_distance))/0.5

This means that each interval has lenght 0.5. This creates 40 "bin". But this is where I get stuck. I do not know how to

  1. create these interval, e.g. (0,0.5], (0.5,1],(1,1.5] ,(1.5,2],(2,2.5] ,(2,5,3]...... and give each of them a name 1, 2, 3, ...., 40
  2. associate each df['distance'] to a specific interval number (from 1.)
item  distance  bin 
0      1         0   1
1      2         1   2
2      3         1   2
3      4         3   6
4      5         4   6
5      6         4  #and so on
6      7         5
7      8         6
8      9         7
9     10         7
10    11         7
11    12         7
12    13         8
13    14         8
14    15        20
15    16        20

Now, I tried someting using pd.cut but doing so:

bins_df = pd.cut(df['distance'], round(number_bins))
bins_unique = bins.unique()

return interval with gaps and not enough categories

[(-0.02, 0.155], (0.93, 1.085], (2.946, 3.101], (3.876, 4.031], (4.961, 5.116], (5.891, 6.047], (6.977, 7.132], (7.907, 8.062], (19.845, 20.0]]
Categories (9, interval[float64]): [(-0.02, 0.155] < (0.93, 1.085] < (2.946, 3.101] < (3.876, 4.031] ... (5.891, 6.047] < (6.977, 7.132] < (7.907, 8.062] < (19.845, 20.0]]

idealy, I would associate every value distance to a category in [1, number_bins] Any idea on how I could achieve my desired output would be greatly appreciated.

Upvotes: 0

Views: 407

Answers (1)

Cimbali
Cimbali

Reputation: 11395

You seemed on the right track with the 2 steps you specified. Here’s how I would carry them out:

  1. generate bounds on the bins
import numpy as np
bounds = np.arange(df['distance'].min(), df['distance'].max() + .5, .5)
bounds
  • np.arange is mostly like range(), but you can specify floating-point bounds and step.
  • The +.5 ensures you get the final bound.

This gives you the following:

array([ 1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,  5.5,  6. ,
        6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5, 10. , 10.5, 11. , 11.5,
       12. , 12.5, 13. , 13.5, 14. , 14.5, 15. , 15.5, 16. ])
  1. use pd.cut
dist_bins = pd.cut(df['distance'], bins=bounds, include_lowest=True)
dist_bins

This uses the fact that you can specify the bins manually, see the doc:

bins : int, sequence of scalars, or IntervalIndex

The criteria to bin by.

  • [...]
  • sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.

Which returns:

0     (0.999, 1.5]
1       (1.5, 2.0]
2       (2.5, 3.0]
3       (3.5, 4.0]
4       (4.5, 5.0]
5       (5.5, 6.0]
6       (6.5, 7.0]
7       (7.5, 8.0]
8       (8.5, 9.0]
9      (9.5, 10.0]
10    (10.5, 11.0]
11    (11.5, 12.0]
12    (12.5, 13.0]
13    (13.5, 14.0]
14    (14.5, 15.0]
15    (15.5, 16.0]
Name: distance, dtype: category
Categories (30, interval[float64]): [(0.999, 1.5] < (1.5, 2.0] < (2.0, 2.5] < (2.5, 3.0] < ... <
                                     (14.0, 14.5] < (14.5, 15.0] < (15.0, 15.5] < (15.5, 16.0]]

Note that as per your specification of bins the distance 1 would not fall in any bin, which is why I used include_lowest=True and why the first bin looks like (0.999, 1.5] (which is basically [1, 1.5]). If you don’t want this you need to start bins below your min()

You get a (sorted) category dtype column (pd.Series) as expected.

If you want the list of the 30 categories that were created, you can access them with the .cat accessor

dist_bins.cat.categories

This returns an IntervalIndex:

IntervalIndex([(0.999, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5] ... (13.5, 14.0], (14.0, 14.5], (14.5, 15.0], (15.0, 15.5], (15.5, 16.0]],
              closed='right',
              dtype='interval[float64]')

As with every index you can access the list of values:

>>> dist_bins.cat.categories.to_list()
[Interval(0.999, 1.5, closed='right'), Interval(1.5, 2.0, closed='right'), Interval(2.0, 2.5, closed='right'), Interval(2.5, 3.0, closed='right'), Interval(3.0, 3.5, closed='right'), Interval(3.5, 4.0, closed='right'), Interval(4.0, 4.5, closed='right'), Interval(4.5, 5.0, closed='right'), Interval(5.0, 5.5, closed='right'), Interval(5.5, 6.0, closed='right'), Interval(6.0, 6.5, closed='right'), Interval(6.5, 7.0, closed='right'), Interval(7.0, 7.5, closed='right'), Interval(7.5, 8.0, closed='right'), Interval(8.0, 8.5, closed='right'), Interval(8.5, 9.0, closed='right'), Interval(9.0, 9.5, closed='right'), Interval(9.5, 10.0, closed='right'), Interval(10.0, 10.5, closed='right'), Interval(10.5, 11.0, closed='right'), Interval(11.0, 11.5, closed='right'), Interval(11.5, 12.0, closed='right'), Interval(12.0, 12.5, closed='right'), Interval(12.5, 13.0, closed='right'), Interval(13.0, 13.5, closed='right'), Interval(13.5, 14.0, closed='right'), Interval(14.0, 14.5, closed='right'), Interval(14.5, 15.0, closed='right'), Interval(15.0, 15.5, closed='right'), Interval(15.5, 16.0, closed='right')]

Upvotes: 1

Related Questions