Reputation: 129
I am trying to use sklearn
to train a decision tree based on my dataset.
When I was trying to slicing the data to (outcome:Y, and predicting variables:X), it turns out that the outcome (my label) is in True
/False
:
#data slicing
X = df.values[:,3:27] #X are the sets of predicting variable, dropping unique_id and student name here
Y = df.values[:,'OffTask'] #Y is our predicted value (outcome), it is in the 3rd column
This is how I do, but I do not know whether this is the right approach:
#convert the label "OffTask" to dummy
df1 = pd.get_dummies(df,columns=["OffTask"])
df1
My trouble is the dataset df1 return my label Offtask
to OffTask_N
and OffTask_Y
Can someone know how to fix it?
Upvotes: 4
Views: 1793
Reputation: 307
get_dummies is used for converting nominal string values to integer. It returns as many as column as many unique string values are available in columns eg:
df={'color':['red','green','blue'],'price':[1200,3000,2500]}
my_df=pd.DataFrame(df)
pd.get_dummies(my_df)
In your case you can drop first value, wherever value is null can be considered it will be first value
Upvotes: 1
Reputation: 16966
You could make the pd.get_dummies
to return only one column by setting drop_first=True
y = pd.get_dummies(df,columns=["OffTask"], drop_first=True)
But this is not the recommended way to convert the label to binaries. I would suggest using labelbinarizer for this purpose.
Example:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit_transform(pd.DataFrame({'OffTask':['yes', 'no', 'no', 'yes']}))
#
array([[1],
[0],
[0],
[1]])
Upvotes: 0