Reputation: 21
I am using the scikit-learn decision trees for a classification problem. My input data has a precision of 4 decimal points. However, due to binary representation errors it is possible that they internal numpy representation may have more than 4 decimal points of significance.
Is there a way for me to instruct the sklearn algorithm not to use threshold values of more than 4 decimal points when computing the binary tree? Otherwise I'm afraid that the results could be meaningless at large depths.
Upvotes: 2
Views: 1503
Reputation: 13743
A possible way to avoid the numeric errors associated to floating point representation in the construction of a decision tree would consist in using integers rather than floats to fit the model. If your input data has a precision of 4 digits you just need to multiply it by 104 and the round to the nearest integer and cast the result to integer like this:
input_data = np.int32(np.around(input_data * 10**4))
Through this feature scaling the condition thresholds are computed more accurately.
In [2]: import numpy as np
In [3]: input_data = np.array([0.0020, 17.0001, 531.4679])
In [4]: np.set_printoptions(precision=32)
In [5]: input_data
Out[5]:
array([ 2.00000000000000004163336342344337e-03,
1.70000999999999997669419826706871e+01,
5.31467899999999985993781592696905e+02])
In [6]: input_data = np.int32(np.around(input_data * 10**4))
In [7]: input_data
Out[7]: array([ 20, 170001, 5314679])
Upvotes: 1