Reputation: 191
I'm using pandas (python 2.7) to evaluate a survey using (partly) the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
First read the .csv
df = pd.read_csv("data_project_638595_2017_05_23.csv", sep=';',usecols=range(6,82) + range(92,112))
Rename the columns (this is an example):
df.rename(columns={"v_27" : "age"}, inplace=True)
Set data types for all examples (this is an example):
df["age"] = df["age"].astype("category")
Age was also asked the participant in cateogries for the survey. Thus age does look like this now, where 2.0 = "20-29 years old":
df.age
age
...
333 2.0
336 2.0
338 2.0
Name: age, dtype: category
Categories (5, float64): [1.0, 2.0, 3.0, 4.0, 5.0]
And its count like this:
df.age.value_counts()
2.0 178
3.0 29
5.0 3
4.0 2
1.0 2
Name: age, dtype: int64
What I now would like to do is to establish and rename the following categories (this would also mean, that "60 +" has 0 counts and categories also should be ordered):
['0-19', '20-29', '30-39', '40-49', '50-59', '60+']
I've tried several methods (e.g. rename_categories) but I just can't get it to work like it should.
What's a feasible solution for this? Thanks in advance!
Upvotes: 2
Views: 6684
Reputation: 210842
use pd.cut method:
df['new'] = pd.cut(df.age,
bins=[0, 19, 29, 39, 49, 59, 999],
labels=['0-19', '20-29', '30-39', '40-49', '50-59', '60+'],
include_lowest=True)
Demo:
In [103]: df = pd.DataFrame(np.random.randint(100, size=(10)), columns=['age'])
In [104]: df
Out[104]:
age
0 10
1 64
2 84
3 14
4 4
5 31
6 98
7 22
8 49
9 50
In [105]: df['new'] = pd.cut(df.age,
...: bins=[0, 19, 29, 39, 49, 59, 999],
...: labels=['0-19', '20-29', '30-39', '40-49', '50-59', '60+'],
...: include_lowest=True)
In [106]: df
Out[106]:
age new
0 10 0-19
1 64 60+
2 84 60+
3 14 0-19
4 4 0-19
5 31 30-39
6 98 60+
7 22 20-29
8 49 40-49
9 50 50-59
UPDATE:
mapping:
In [20]: d
Out[20]: {0: '0-19', 1: '20-29', 2: '30-39', 3: '40-49', 4: '50-59', 5: '60+'}
source DF:
In [21]: df
Out[21]:
age
0 0
1 3
2 2
3 3
4 4
5 2
6 0
7 3
8 2
9 4
Mapped age:
In [22]: df.age.map(d)
Out[22]:
0 0-19
1 40-49
2 30-39
3 40-49
4 50-59
5 30-39
6 0-19
7 40-49
8 30-39
9 50-59
Name: age, dtype: object
Upvotes: 1