Reputation: 77
I am doing a multi-label classification project with scikitlearn. What I am going to do is to binarize the target feature, however, I have some difficulties during the data transform.
Here is the raw data:
107 RA37|RA41|RM153 |RWT037
108 DA35|DA47|DWT030|DA35|DA47|DWT030
109 NaN
110 PI001 |PI040
111 PI001 |PI040
112 RA37|RA41|RWT037
113 DA35|DA47|DWT030|DA35|DA47|DWT030
114 NaN
Name: exclusions, dtype: object
Then I split it up to more columns with str.split('|',expand=True)
and I got the following output:
0 1 2 3 4 5 6 7 8 9 ... 18 19 20 21 22 23 24 25 26 27
107 RA37 RA41 RM153 RWT037 None None None None None None ... None None None None None None None None None None
108 DA35 DA47 DWT030 DA35 DA47 DWT030 None None None None ... None None None None None None None None None None
109 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
110 PI001 PI040 None None None None None None None None ... None None None None None None None None None None
111 PI001 PI040 None None None None None None None None ... None None None None None None None None None None
112 RA37 RA41 RWT037 None None None None None None None ... None None None None None None None None None None
113 DA35 DA47 DWT030 DA35 DA47 DWT030 None None None None ... None None None None None None None None None None
114 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
As you can see, Since there are tons of NaN before processed, the result is mixed with NaN and None. That means I cannot directly use multilaberbinarizer to deal with all these different data types. How do it fix this problem, thanks in advance!
Upvotes: 1
Views: 1763
Reputation: 36609
Assuming the following list to be your multi-label targets:
107 RA37|RA41|RM153 |RWT037
108 DA35|DA47|DWT030|DA35|DA47|DWT030
109 NaN
110 PI001 |PI040
111 PI001 |PI040
112 RA37|RA41|RWT037
113 DA35|DA47|DWT030|DA35|DA47|DWT030
114 NaN
Part 1: Handling the Nan
:
There are multiple ways to handle the Nan
s:
1) 'Nan'
as target doesnt make sense. If you dont know what the target is, how will you train the model for that and how will you compare it to output.
So the solution here is to remove the complete samples (rows) which have Nan
s in it. So the resultant targets will look like this:
107 RA37|RA41|RM153 |RWT037
108 DA35|DA47|DWT030|DA35|DA47|DWT030
110 PI001 |PI040
111 PI001 |PI040
112 RA37|RA41|RWT037
113 DA35|DA47|DWT030|DA35|DA47|DWT030
2) Replace the Nan with a new label, something like Unknown or Unclassified.
107 RA37|RA41|RM153 |RWT037
108 DA35|DA47|DWT030|DA35|DA47|DWT030
109 UNKNOWN
110 PI001 |PI040
111 PI001 |PI040
112 RA37|RA41|RWT037
113 DA35|DA47|DWT030|DA35|DA47|DWT030
114 UNKNOWN
Part 2: Using MultiLabelBinarizer:
In both the above solutions, you will get a list of targets something like this:
y = ['RA37|RA41|RM153|RWT037', 'DA35|DA47|DWT030|DA35|DA47|DWT030', 'UNKNOWN', 'PI001|PI040', 'PI001|PI040', 'RA37|RA41|RWT037', 'DA35|DA47|DWT030|DA35|DA47|DWT030', 'UNKNOWN']
But MultilabelBinarizer accepts a list of list, so we need to split the above strings as you were doing:
y = [y_val.split('|') for y_val in y]
Now y
is in correct format. Now use the MLB:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_encoded = mlb.fit_transform(y)
# Output:
array([[0, 0, 0, 0, 0, 1, 1, 1, 1, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 1, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
You this can be used in a model of your choice (which should support the indicator matrix format above) for y
.
Upvotes: 2