buhtz
buhtz

Reputation: 12202

Specifify ordered cateogry in pandas read_csv()

When reading a CSV file with pandas I can specify the column types via the dtypes parameter.

I can specify category like this.

#!/usr/bin/env python3
import io
import csv
import pandas as pd

# there is a missing/empty value in 4th row
csv_data = """FOO;BAR\n
1;20204\n
5;20183\n
5;20182\n
4;20212\n"""

df = pd.read_csv(io.StringIO(csv_data),
                 keep_default_na=False,
                 header=0,
                 sep=';',
                 dtype={'BAR': 'category'})

print(df.BAR)

print(df.BAR.cat.as_ordered())

BAR is a category but not ordered. In my case it is easy to determine the ordering automatic. The values can be treated as strings or numbers.

Can I do something like dtype={'BAR': 'ordered category'}?

EDIT: Because of the answers and comments that are out of my questions scope I would like to emphasise that I want to define the colum as an ordered categorial while read_csv(). I know how to create a ordered categorial but not while reading a CSV file.

Upvotes: 1

Views: 262

Answers (4)

mozway
mozway

Reputation: 261850

You can specify a custom CategoricalDtype, but this means that you must know the specific categories in advance.

MyCat = pd.CategoricalDtype(['20182', '20183', '20204', '20212'], ordered=True)

df = pd.read_csv(io.StringIO(csv_data),
                 keep_default_na=False,
                 header=0,
                 sep=';',
                 dtype={'BAR': MyCat})

df['BAR']

0    20204
1    20183
2    20182
3    20212
Name: BAR, dtype: category
Categories (4, object): ['20182' < '20183' < '20204' < '20212']

In your example this doesn't really make sense as you already have lexicographically orderable value. Thus specifying the values there is not practical. Considering however another example such as pd.CategoricalDtype(['Small', 'Medium', 'Large'], ordered=True), it's easy to see that the order must be specified in the beginning.

Upvotes: 0

buhtz
buhtz

Reputation: 12202

The specific answer to my specific question is, that it is not possible.

While doing read_csv() there is no way to specific an ordered category.

Upvotes: 0

You can use sort_values() like this:

csv_data = """FOO;BAR\n
1;20204\n
5;20003\n
5;20182\n
4;20212\n"""

df = pd.read_csv(io.StringIO(csv_data),
                 keep_default_na=False,
                 header=0,
                 sep=';',
                 dtype={'BAR': 'category'})

print(df.BAR)

print(df.BAR.sort_values(ascending=True))

which returns

0    20204
1    20003
2    20182
3    20212
Name: BAR, dtype: category
Categories (4, object): ['20003', '20182', '20204', '20212']
1    20003
2    20182
0    20204
3    20212
Name: BAR, dtype: category
Categories (4, object): ['20003', '20182', '20204', '20212']

Upvotes: 0

mozway
mozway

Reputation: 261850

You can use pandas.Categorical. It has and ordered parameter.

>>> s = pd.Categorical(['A', 'B', 'C', 'A', 'B', 'C'], categories=['B', 'A', 'C'], ordered=True)
>>> s
['A', 'B', 'C', 'A', 'B', 'C']
Categories (3, object): ['B' < 'A' < 'C']
>>> s.sort_values()
['B', 'B', 'A', 'A', 'C', 'C']
Categories (3, object): ['B' < 'A' < 'C']

Upvotes: 3

Related Questions