Reputation: 805
I am pretty new with scikitlearn and right now I am struggling with the preprocessing stage.
I have the following categorical features (I parsed a JSON file and place it in a dictionary) so:
dct['alcohol'] = ["Binge drinking",
"Heavy drinking",
"Moderate consumption",
"Low consumption",
"No consumption"]
dct['tobacco']= ["Current daily smoker - heavy",
"Current daily smoker",
"Current on-and-off smoker",
"Former Smoker",
"Never Smoked",
"Snuff User"]
dct['onset'] = "Gradual",
"Sudden"]
My first approach is to convert it first to integers with label enconder and then to the one-hot-coding method:
OH_enc = sklearn.preprocessing.OneHotEncoder(n_values=[len(dct['alcohol']),len(dct['tobacco']),len(dct['onset'])])
le_alc = sklearn.preprocessing.LabelEncoder()
le_tobacco = sklearn.preprocessing.LabelEncoder()
le_onset = sklearn.preprocessing.LabelEncoder()
le_alc.fit(dct['alcohol'])
le_tobacco.fit(dct['tobacco'])
le_onset.fit(dct['onset'])
list_patient = []
list_patient.append(list(le_alc.transform(['Low consumption'])))
list_patient.append(list(le_tobacco.transform(['Former Smoker'])))
list_patient.append(list(le_onset.transform(['Sudden'])))
list1 = []
list1.append(np.array(list_patient).T[0][:])
list1.append([1,2,0])
OH_enc.fit(list1)
print(OH_enc.transform([[4,2,0]]).toarray())
So eventually if you OHE (4,2,0) you get :
[[0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]]
Which is what I want since the first 5 columns refers to the "alcohol" feature, the 6 next columns refers to tobacco, and the last 2 columns refers to the onset feature.
However, let's assume that one example could have more than one value in one feature. Let's say one example gets "Binge drinking" and "Heavy drinking" from the alcohol feature. Then, if you OHE ([0,1],2,0) you would get:
[[1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0.]]
This last step I do not know how to code it with sklearn.preprocessing.OneHotEncoder. I mean, how can I code 2 values in one feature per example?
I know that there might be a better way to code "alcohol", "tobacco", and "onset" because they are ordinal values (and then each value in each feature correlates to the other value in the same feature. Thus I could just label them and then normalize it.But let's assume those are categorical variables with independent relationship.
Upvotes: 0
Views: 1224
Reputation: 805
I finally solved it using MultilabelBinarizer, as @VivekKumar suggested:
headings = dct['alcohol'] + dct['tobacco'] + dct['onset']
#print('my headings:'+ str(headings))
l1 = ['Heavy drinking, Low consumption, Former Smoker, Gradual', 'Low consumption, No consumption, Current on-and-off smoker, Sudden', 'Heavy drinking, Current on-and-off smoker']
mlb = MultiLabelBinarizer() # pass sparse_output=True if you'd like
dataMatrix = mlb.fit_transform(headings.split(', ') for headings in l1)
print("My Classes: ")
print(mlb.classes_)
print(dataMatrix)
Upvotes: 2