Reputation: 977
I have a dataframe say df
with a column 'Ages'
>>> df['Age']
0 22
1 38
2 26
3 35
4 35
5 -1
6 54
I want to group this ages and create a new column something like this
If age >= 0 & age < 2 then AgeGroup = Infant
If age >= 2 & age < 4 then AgeGroup = Toddler
If age >= 4 & age < 13 then AgeGroup = Kid
If age >= 13 & age < 20 then AgeGroup = Teen
and so on .....
How can I achieve this using Pandas library?
I tried doing this something like this
X_train_data['AgeGroup'][ X_train_data.Age < 13 ] = 'Kid'
X_train_data['AgeGroup'][ X_train_data.Age < 3 ] = 'Toddler'
X_train_data['AgeGroup'][ X_train_data.Age < 1 ] = 'Infant'
but doing this i get this warning
/Users/Anand/miniconda3/envs/learn/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
This is separate from the ipykernel package so we can avoid doing imports until
/Users/Anand/miniconda3/envs/learn/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
How to avoid this warning and do it in a better way.
Upvotes: 11
Views: 58599
Reputation: 862601
Use pandas.cut
with parameter right=False
for not includes the rightmost edge of bins:
X_train_data = pd.DataFrame({'Age':[0,2,4,13,35,-1,54]})
bins= [0,2,4,13,20,110]
labels = ['Infant','Toddler','Kid','Teen','Adult']
X_train_data['AgeGroup'] = pd.cut(X_train_data['Age'], bins=bins, labels=labels, right=False)
print (X_train_data)
Age AgeGroup
0 0 Infant
1 2 Toddler
2 4 Kid
3 13 Teen
4 35 Adult
5 -1 NaN
6 54 Adult
Last for replace missing value use add_categories
with fillna
:
X_train_data['AgeGroup'] = X_train_data['AgeGroup'].cat.add_categories('unknown')
.fillna('unknown')
print (X_train_data)
Age AgeGroup
0 0 Infant
1 2 Toddler
2 4 Kid
3 13 Teen
4 35 Adult
5 -1 unknown
6 54 Adult
bins= [-1,0,2,4,13,20, 110]
labels = ['unknown','Infant','Toddler','Kid','Teen', 'Adult']
X_train_data['AgeGroup'] = pd.cut(X_train_data['Age'], bins=bins, labels=labels, right=False)
print (X_train_data)
Age AgeGroup
0 0 Infant
1 2 Toddler
2 4 Kid
3 13 Teen
4 35 Adult
5 -1 unknown
6 54 Adult
Upvotes: 34
Reputation: 3926
Just use:
X_train_data.loc[(X_train_data.Age < 13), 'AgeGroup'] = 'Kid'
Upvotes: 3