Reputation: 13
I am doing some data analysis with pandas and am struggling to find a nice, clean way of summing up a range of numbers. I have a data frame with a column of floats, however I am not interested in the exact number, but a rough range. Ultimately I want to run a pivot and count how many values are in a certain range. Therefore ideally I would want to create a new column in my data frame, that converts my column of floats into a range. Say df[number] = 3.5, then df[range] = 0-10
The ranges should be 0-10, 10-20, ... >100
This may sound very arbitrary, but I've been struggling to find an answer on this. Many thanks
Upvotes: 0
Views: 2286
Reputation: 28956
Pandas has a cut
function for this
In [18]: s = pd.Series(np.random.uniform(0, 110, 100))
In [19]: s
Out[19]:
0 57.614427
1 30.576853
2 95.578943
3 53.010340
4 63.947381
...
95 42.252644
96 14.814418
97 81.271527
98 5.732966
99 90.932890
In [12]: s = pd.Series(np.random.uniform(0, 110, 100))
In [13]: s
Out[13]:
0 2.652461
1 46.536276
2 6.455352
3 6.075963
4 40.013378
...
95 39.775493
96 99.688307
97 41.064469
98 91.401904
99 60.580600
dtype: float64
In [14]: cuts = np.arange(0, 101, 10)
In [15]: pd.cut(s, cuts)
Out[15]:
0 (0, 10]
1 (40, 50]
2 (0, 10]
3 (0, 10]
4 (40, 50]
...
95 (30, 40]
96 (90, 100]
97 (40, 50]
98 (90, 100]
99 (60, 70]
dtype: category
Categories (10, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40] ... (60, 70] < (70, 80] < (80, 90] <
(90, 100]]
See the docs for controlling what happens with endpoints.
Note that in 0.18 (coming out soonish) the result will be an IntervalIndex instead of a Categorical, which will make things even nicer.
To get your counts per interval, use the value_counts
method
In [17]: pd.cut(s, cuts).value_counts()
Out[17]:
(30, 40] 15
(40, 50] 13
(50, 60] 12
(60, 70] 10
(0, 10] 10
(90, 100] 8
(70, 80] 8
(80, 90] 7
(10, 20] 6
(20, 30] 3
dtype: int64
Upvotes: 1
Reputation: 2318
Using the properties of integer division should help. Because you want ranges in units of 10, dividing a number by 10 (13.5 / 10 == 1.35
), converting it to an integer (int(1.35) == 1
), and then multiplying by 10 (1 * 10 == 10
) will convert the number to the low-end of the range it falls into. This might need some refinement (especially for negative numbers), but you could try something like:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'vals': [3.5, 4.2, 10.5, 19.5, 20.3, 24.2]})
>>> df
vals
0 3.5
1 4.2
2 10.5
3 19.5
4 20.3
5 24.2
>>> df['range_start'] = np.floor(df['vals'] / 10) * 10
>>> df
vals range_start
0 3.5 0
1 4.2 0
2 10.5 10
3 19.5 10
4 20.3 20
5 24.2 20
Upvotes: 0
Reputation: 17771
def get_range_for(x, start=0, stop=100, step=10):
if x < start:
return (float('-inf'), start)
if x >= stop:
return (stop, float('inf'))
left = step * ((x - start) // step)
right = left + step
return (left, right)
Examples:
>>> get_range_for(3.5)
(0.0, 10.0)
>>> get_range_for(27.3)
(20.0, 30.0)
>>> get_range_for(75.6)
(70.0, 80.0)
Corner cases:
>>> get_range_for(-100)
(-inf, 0)
>>> get_range_for(1234)
(100, inf)
>>> get_range_for(0)
(0, 10)
>>> get_range_for(10)
(10, 20)
Upvotes: 0