Reputation: 12202
When reading a CSV file with pandas I can specify the column types via the dtypes
parameter.
I can specify category
like this.
#!/usr/bin/env python3
import io
import csv
import pandas as pd
# there is a missing/empty value in 4th row
csv_data = """FOO;BAR\n
1;20204\n
5;20183\n
5;20182\n
4;20212\n"""
df = pd.read_csv(io.StringIO(csv_data),
keep_default_na=False,
header=0,
sep=';',
dtype={'BAR': 'category'})
print(df.BAR)
print(df.BAR.cat.as_ordered())
BAR
is a category but not ordered
.
In my case it is easy to determine the ordering automatic. The values can be treated as strings or numbers.
Can I do something like dtype={'BAR': 'ordered category'}
?
EDIT:
Because of the answers and comments that are out of my questions scope I would like to emphasise that I want to define the colum as an ordered categorial while read_csv()
.
I know how to create a ordered categorial but not while reading a CSV file.
Upvotes: 1
Views: 262
Reputation: 261850
You can specify a custom CategoricalDtype
, but this means that you must know the specific categories in advance.
MyCat = pd.CategoricalDtype(['20182', '20183', '20204', '20212'], ordered=True)
df = pd.read_csv(io.StringIO(csv_data),
keep_default_na=False,
header=0,
sep=';',
dtype={'BAR': MyCat})
df['BAR']
0 20204
1 20183
2 20182
3 20212
Name: BAR, dtype: category
Categories (4, object): ['20182' < '20183' < '20204' < '20212']
In your example this doesn't really make sense as you already have lexicographically orderable value. Thus specifying the values there is not practical. Considering however another example such as pd.CategoricalDtype(['Small', 'Medium', 'Large'], ordered=True)
, it's easy to see that the order must be specified in the beginning.
Upvotes: 0
Reputation: 12202
The specific answer to my specific question is, that it is not possible.
While doing read_csv()
there is no way to specific an ordered category.
Upvotes: 0
Reputation: 11504
You can use sort_values()
like this:
csv_data = """FOO;BAR\n
1;20204\n
5;20003\n
5;20182\n
4;20212\n"""
df = pd.read_csv(io.StringIO(csv_data),
keep_default_na=False,
header=0,
sep=';',
dtype={'BAR': 'category'})
print(df.BAR)
print(df.BAR.sort_values(ascending=True))
which returns
0 20204
1 20003
2 20182
3 20212
Name: BAR, dtype: category
Categories (4, object): ['20003', '20182', '20204', '20212']
1 20003
2 20182
0 20204
3 20212
Name: BAR, dtype: category
Categories (4, object): ['20003', '20182', '20204', '20212']
Upvotes: 0
Reputation: 261850
You can use pandas.Categorical
. It has and ordered
parameter.
>>> s = pd.Categorical(['A', 'B', 'C', 'A', 'B', 'C'], categories=['B', 'A', 'C'], ordered=True)
>>> s
['A', 'B', 'C', 'A', 'B', 'C']
Categories (3, object): ['B' < 'A' < 'C']
>>> s.sort_values()
['B', 'B', 'A', 'A', 'C', 'C']
Categories (3, object): ['B' < 'A' < 'C']
Upvotes: 3