Reputation: 414
I'm trying to implement a DecisionTreeClassifier from sklearn with a DataFrame (pandas), but it returns some weird values when splitting my data. My dataset contains 3 columns with Pearsons Correlation coefficients, which can be only between -1.0 and 1.0. The root node, however, already starts splitting by one of these columns at Pearsons <= 1.0 and shows two child nodes for True and False. But, it's impossible!! All the values are <= 1.0. There's no way that a split could have been made there. Does anyone has any idea what is going on here?
In my code I tried both Gini and Entropy criterion, both splitters and other different combinations of the possible Parameters. Here is more or less my code now, but I'm still playing around with the Parameters:
newtable = table_of_pickle_ptptnew.loc[:,('Pearsons Ratio', 'Pearsons 330nm', 'Pearsons 350nm', 'Ratio Space', '330nm Similarity', '350nm Similarity')]
x = newtable.values
y = table_of_pickle_ptptnew['Binding Known'].values
dtree=DecisionTreeClassifier(max_features='auto',
max_depth=3,
criterion ='entropy',
min_impurity_decrease=0.09
)
fittree = dtree.fit(x, y.astype('str'))
dot_data = tree.export_graphviz(fittree, out_file=None,
class_names=['No Interaction', 'Interaction'],
feature_names=['Pearsons Ratio', 'Pearsons 330nm', 'Pearsons 350nm', 'Ratio Space', '330nm Similarity', '350nm Similarity'],
filled=True)
graph = graphviz.Source(dot_data)
graph
Pearsons Ratio Pearsons 330nm Pearsons 350nm Ratio Space 330nm Similarity 350nm Similarity
Elem a 0.94856 0.99999 0.99999 0.000725507 0.157209 0.0572688
Elem b 0.99234 1 0.99999 0.00657003 0.0568281 0.0465139
Elem c 0.98525 0.99999 0.99999 0.0114932 0.0226809 0.133452
Elem d 0.99793 0.99999 0.99999 0.000643209 0.154585 0.0914759
Elem e 0.99849 0.99999 0.99999 0.00128532 0.0932893 0.0464462
Here is how the first nodes of the tree looks like. So, what I mean is that the child node for False in the condition of the root node (Pearson 350nm <= 1.0) is impossible to exist, since all samples are <= 1.0 (True).
Upvotes: 1
Views: 503
Reputation: 414
Ok. I found out what was the problem. The graphviz visualization of the tree has a limit to the decimal numbers and round them if too large. I used a algorithm to give me the pseudo-code for my decision tree automatically and in the code output the 'true values' showed up. In the graphical tree from graphviz the 1.0 from the root node is actually '0.9999749660491943'.
I think it's important to know this for everyone that is working with scientific numbers that have a great deal of digits. :) If you work with numbers like this, remember to get the decision code from your tree and don't go only for the pretty colorful tree.
Thank you to everyone that used a bit of their time to try to help me with my issue. :)
Upvotes: 2