AdrienNK
AdrienNK

Reputation: 848

Automated buiding of decision tree from function

(Sorry if this question feels a little bit brainstormy)

I have a function F with parameters a_1, a_2... and b which outputs x. The function is also defined by a serie of p_1, p_2... parameters that may change during my work.

F(a_1, a_2... , b) = x

Given a_1, a_2... I'd like to build a decision tree algorithm that find the b that minimizes x for the function F. I'd like to automate this decision tree builder in order to accomodate the changes of F (through p_1, p_2...).

The automation process is quite important as in practice the a_x can be anything (integers, continuous numbers, discrete parameters) and F is highly non-linear.

One of my instinctive idea is to build fake samples and learning a decision tree on the dataset which would give me the decision tree I need. However this seems overly complicated as I have acces to the function generating this problem.

If someone has any idea or point me in any direction that would help me solve my problem that would be greatly appreciated.

EDIT :

I am changing the scope of my question :

Assuming from the initial problem you got the function F' which maps a_1, a_2... to b (b is discrete). Would there be an algorithm that tries to "simplify" F' by a decision tree with the a_1, a_2... as nodes.

enter image description here

For example a decision tree that would say if a_2 = "type2" and a_1 < 6 -> 3 etc. I am not looking for an exact partitioning, a decent estimation is sufficient.

I was thinking of using the ML algorithm for building decision tree using fake samples generated by monte-carlo simulation of F'. Would that make sense?

Upvotes: 0

Views: 151

Answers (1)

igrinis
igrinis

Reputation: 13676

Your idea makes sense, if you want the fast and greedy approximation of F', but you should take care to implement it right:

1) As you state that your variables might be categorical and numerical, you should think about the way of binning the variables. For highly non-linear functions widely accepted linear binning would not be the optimal.

2) As decision trees have trouble to deal with correlated variables, some preprocessing might help to elevate the issue. Try starting with PCA.

As for using real data or generating it from the original function, I do not think there should be significant difference. You might want to augment the initial training dataset here and there, if you see that in some "areas" your data is underrepresented.

Upvotes: 1

Related Questions