Reputation: 4840
I have a dataset containing ints, floats and strings. I (think I) converted all string to categories by the following statements:
for col in list (X):
if X[col].dtype == np.object_:#dtype ('object'):
X [col] = X [col].str.lower().astype('category', copy=False)
However, when I want to do input the data for a random forest model I get the error:
ValueError: could not convert string to float: 'non-compliant by no payment'
The string 'non-compliant by no payment' occurs in a column named X['compliance_detail']
and when I request its dtype
I get category
. When I ask its values:
In[111]: X['compliance_detail'].dtype
Out[111]: category
In[112]: X['compliance_detail'].value_counts()
Out[112]:
non-compliant by no payment 5274
non-compliant by late payment more than 1 month 939
compliant by late payment within 1 month 554
compliant by on-time payment 374
compliant by early payment 10
compliant by payment with no scheduled hearing 7
compliant by payment on unknown date 3
Name: compliance_detail, dtype: int64
Does somebody know what's happening here? Why is a string seen in categorial data? Why is a dtype of Int64 listed for this column?
Thank you for your time.
Upvotes: 1
Views: 464
Reputation: 4840
I should have read the docs more carefully ;-) Most statistical tests in sklearn do not handle categories, as they do in R. RandomForestClassifiers can handle categories without problems in theory, the implementation in sklearn does not allow it (for now). My mistake was to think that they could do so, because theory says they can and it worked nicely in R. However, the sklearn documentation says the following about the fit function:
X : array-like or sparse matrix of shape = [n_samples, n_features]
The training input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csc_matrix.
Thus no room for categories, and when they are factorized they are considered as numbers. In this article it is explained how categories work in Pandas and what their pitfalls are. I advise everybody who wants to use categories to read it, especially when with an R background. I hope this aspect will be improved as in the current situation one cannot make full use of some procedures.
Upvotes: 1
Reputation: 402263
When you convert to category type, the column remains in its original repr, but pandas keeps track of the categories.
s
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
Name: A, dtype: object
s = s.astype('category')
s
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
Name: A, dtype: category
Categories (2, object): [bar, foo]
If you want integer categories, you've a few options:
Option 1
cat.codes
s.cat.codes
0 1
1 0
2 1
3 0
4 1
5 0
6 1
7 1
dtype: int8
Option 2
pd.factorize
(astype
not required)
pd.factorize(s)[0]
array([0, 1, 0, 1, 0, 1, 0, 0])
Upvotes: 1