Reputation: 129
I am trying to encode the target column of my DataFrame
. The type of the variable contained by this column is object.
I have a DataFrame
that contains all the codes - icd10
. Using those, I am trying to binarize the labels of my infoDF
DataFrame
.
My code looks like this:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
#fit all the possible label codes
lb.fit(icd10['ICD10'])
temp = lb.transform(infoDF['Target'])
for i,x in enumerate(lb.classes_):
infoDF[x] = temp[ : , i]
When I run it, I get the following traceback:
File "<ipython-input-42-2b1db450b16e>", line 3, in <module>
lb.fit(icd10['ICD10'])
File "C:\Users\as\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 413, in fit
self.classes_ = unique_labels(y)
File "C:\Users\as\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 96, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
20 21
21 22
22 A00
23 A000
24 A001
25 A009
26 A00–A09
27 A01
28 A010
29 A011
19433 Z960
19434 Z961
19435 Z962
19436 Z963
19437 Z964
19438 Z965
19439 Z966
19440 Z967
19441 Z968
19442 Z969
19443 Z97
19444 Z970
19445 Z971
19446 Z972
19447 Z973
19448 Z974
19449 Z975
19450 Z978
19451 Z98
19452 Z980
19453 Z981
19454 Z982
19455 Z988
19456 Z99
19457 Z990
19458 Z991
19459 Z992
19460 Z993
19461 Z998
19462 Z999
Name: ICD10, Length: 19463, dtype: object,)
I am not sure what I am doing wrong..
Upvotes: 3
Views: 1178
Reputation: 31739
Although we don't have the exact format of your data set, it looks like the initial integers cause the problem.
sklearn's LabelBinarizer calls sklearn.utils.multiclass.unique_labels
which according to the documentation does not allow "a mix of string and integer labels".
Try to remove the first 21 rows and see if the error persists.
import pandas as pd
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
icd11 = pd.DataFrame({'ICD11': [0, '6C51', '6C50.Z']})
# crashes
lb.fit(icd11['ICD11'])
# does not crash
lb.fit(icd11['ICD11'][1:])
Upvotes: 4