Arnold
Arnold

Reputation: 4840

Why is a category column seen as a column of strings in pandas?

I have a dataset containing ints, floats and strings. I (think I) converted all string to categories by the following statements:

for col in list (X):
    if X[col].dtype == np.object_:#dtype ('object'):
        X [col] = X [col].str.lower().astype('category', copy=False)

However, when I want to do input the data for a random forest model I get the error:

ValueError: could not convert string to float: 'non-compliant by no payment'

The string 'non-compliant by no payment' occurs in a column named X['compliance_detail'] and when I request its dtype I get category. When I ask its values:

In[111]: X['compliance_detail'].dtype
Out[111]: category
In[112]: X['compliance_detail'].value_counts()
Out[112]: 
non-compliant by no payment                        5274
non-compliant by late payment more than 1 month     939
compliant by late payment within 1 month            554
compliant by on-time payment                        374
compliant by early payment                           10
compliant by payment with no scheduled hearing        7
compliant by payment on unknown date                  3
Name: compliance_detail, dtype: int64

Does somebody know what's happening here? Why is a string seen in categorial data? Why is a dtype of Int64 listed for this column?

Thank you for your time.

Upvotes: 1

Views: 464

Answers (2)

Arnold
Arnold

Reputation: 4840

I should have read the docs more carefully ;-) Most statistical tests in sklearn do not handle categories, as they do in R. RandomForestClassifiers can handle categories without problems in theory, the implementation in sklearn does not allow it (for now). My mistake was to think that they could do so, because theory says they can and it worked nicely in R. However, the sklearn documentation says the following about the fit function:

X : array-like or sparse matrix of shape = [n_samples, n_features]

The training input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csc_matrix.

Thus no room for categories, and when they are factorized they are considered as numbers. In this article it is explained how categories work in Pandas and what their pitfalls are. I advise everybody who wants to use categories to read it, especially when with an R background. I hope this aspect will be improved as in the current situation one cannot make full use of some procedures.

Upvotes: 1

cs95
cs95

Reputation: 402263

When you convert to category type, the column remains in its original repr, but pandas keeps track of the categories.

s

0    foo
1    bar
2    foo
3    bar
4    foo
5    bar
6    foo
7    foo
Name: A, dtype: object

s = s.astype('category')
s

0    foo
1    bar
2    foo
3    bar
4    foo
5    bar
6    foo
7    foo
Name: A, dtype: category
Categories (2, object): [bar, foo]

If you want integer categories, you've a few options:

Option 1
cat.codes

s.cat.codes
0    1
1    0
2    1
3    0
4    1
5    0
6    1
7    1
dtype: int8

Option 2
pd.factorize (astype not required)

pd.factorize(s)[0]
array([0, 1, 0, 1, 0, 1, 0, 0])

Upvotes: 1

Related Questions