reMJ
reMJ

Reputation: 21

How to control the precision of the scikit-learn decision tree algorithm

I am using the scikit-learn decision trees for a classification problem. My input data has a precision of 4 decimal points. However, due to binary representation errors it is possible that they internal numpy representation may have more than 4 decimal points of significance.

Is there a way for me to instruct the sklearn algorithm not to use threshold values of more than 4 decimal points when computing the binary tree? Otherwise I'm afraid that the results could be meaningless at large depths.

Upvotes: 2

Views: 1503

Answers (1)

Tonechas
Tonechas

Reputation: 13743

A possible way to avoid the numeric errors associated to floating point representation in the construction of a decision tree would consist in using integers rather than floats to fit the model. If your input data has a precision of 4 digits you just need to multiply it by 104 and the round to the nearest integer and cast the result to integer like this:

input_data = np.int32(np.around(input_data * 10**4))

Through this feature scaling the condition thresholds are computed more accurately.

Demo

In [2]: import numpy as np

In [3]: input_data = np.array([0.0020, 17.0001, 531.4679])

In [4]: np.set_printoptions(precision=32)

In [5]: input_data
Out[5]: 
array([  2.00000000000000004163336342344337e-03,
         1.70000999999999997669419826706871e+01,
         5.31467899999999985993781592696905e+02])

In [6]: input_data = np.int32(np.around(input_data * 10**4))

In [7]: input_data
Out[7]: array([     20,  170001, 5314679])

Upvotes: 1

Related Questions