Reputation: 79
So i have this big dataframe with alot of columns like age, name, sex, etc.
I want to make a new column with age group between 1-10, 11-20, 21-30,...,71-80
I tried to do
ranges = [1, 10, 20, 30, 40, 50, 60, 70, 80]
df.age.groupby(pd.cut(df.age, ranges)).count()
and the result is
age
(1, 10] 64
(10, 20] 162
(20, 30] 361
(30, 40] 210
(40, 50] 132
(50, 60] 62
(60, 70] 27
(70, 80] 6
Name: age, dtype: int64
which is exactly what i wanted but the groups are incorrect. i want it to be 1-10 and then 11-20 not 1-10 and 10-20. Can anybody help me solve this problem?
Upvotes: 1
Views: 322
Reputation: 862611
I think first is necessary explain by comment of @samthegolden:
(10, 20] means "between 10 and 20, excluding 10 and including 20" due to the parenthesis format.
But you can do it by labels
parameter created by ranges
with zip
in list comprehension:
np.random.seed(2020)
df = pd.DataFrame({'age':np.random.randint(1, 80, size=100)})
ranges = [1, 10, 20, 30, 40, 50, 60, 70, 80]
labels = ['{}-{}'.format(i + 1, j) for i, j in zip(ranges[:-1], ranges[1:])]
labels[0] = '{}-{}'.format(ranges[0], ranges[1])
print (labels)
['1-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80']
ranges = [1, 10, 20, 30, 40, 50, 60, 70, 80]
s = df.age.groupby(pd.cut(df.age, ranges, labels=labels)).count()
print (s)
age
1-10 14
11-20 10
21-30 15
31-40 12
41-50 7
51-60 11
61-70 18
71-80 12
Name: age, dtype: int64
Upvotes: 1