cian
cian

Reputation: 191

Create / Rename Categories with Pandas

I'm using pandas (python 2.7) to evaluate a survey using (partly) the following code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

First read the .csv

df = pd.read_csv("data_project_638595_2017_05_23.csv", sep=';',usecols=range(6,82) + range(92,112))

Rename the columns (this is an example):

df.rename(columns={"v_27" : "age"}, inplace=True)

Set data types for all examples (this is an example):

df["age"] = df["age"].astype("category")

Age was also asked the participant in cateogries for the survey. Thus age does look like this now, where 2.0 = "20-29 years old":

df.age

       age
...
333    2.0
336    2.0
338    2.0
Name: age, dtype: category
Categories (5, float64): [1.0, 2.0, 3.0, 4.0, 5.0]

And its count like this:

df.age.value_counts()

2.0    178
3.0     29
5.0      3
4.0      2
1.0      2
Name: age, dtype: int64

What I now would like to do is to establish and rename the following categories (this would also mean, that "60 +" has 0 counts and categories also should be ordered):

['0-19', '20-29', '30-39', '40-49', '50-59', '60+']

I've tried several methods (e.g. rename_categories) but I just can't get it to work like it should.

What's a feasible solution for this? Thanks in advance!

Upvotes: 2

Views: 6684

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

use pd.cut method:

df['new'] = pd.cut(df.age, 
                   bins=[0, 19, 29, 39, 49, 59, 999], 
                   labels=['0-19', '20-29', '30-39', '40-49', '50-59', '60+'],
                   include_lowest=True)

Demo:

In [103]: df = pd.DataFrame(np.random.randint(100, size=(10)), columns=['age'])

In [104]: df
Out[104]:
   age
0   10
1   64
2   84
3   14
4    4
5   31
6   98
7   22
8   49
9   50

In [105]: df['new'] = pd.cut(df.age,
     ...:                    bins=[0, 19, 29, 39, 49, 59, 999],
     ...:                    labels=['0-19', '20-29', '30-39', '40-49', '50-59', '60+'],
     ...:                    include_lowest=True)

In [106]: df
Out[106]:
   age    new
0   10   0-19
1   64    60+
2   84    60+
3   14   0-19
4    4   0-19
5   31  30-39
6   98    60+
7   22  20-29
8   49  40-49
9   50  50-59

UPDATE:

mapping:

In [20]: d
Out[20]: {0: '0-19', 1: '20-29', 2: '30-39', 3: '40-49', 4: '50-59', 5: '60+'}

source DF:

In [21]: df
Out[21]:
   age
0    0
1    3
2    2
3    3
4    4
5    2
6    0
7    3
8    2
9    4

Mapped age:

In [22]: df.age.map(d)
Out[22]:
0     0-19
1    40-49
2    30-39
3    40-49
4    50-59
5    30-39
6     0-19
7    40-49
8    30-39
9    50-59
Name: age, dtype: object

Upvotes: 1

Related Questions