Decision tree implementation for returning the next feature to split the tree

Question

Suppose my data consist of fruits, described by their color and shape and more features. I would like to return maximum of X fruits that have the features the user stated and I would like to do it in the minimum num of questions.

The first questions I always ask the user are what is the color and shape of the fruit. According to the user answer I would like to ask for K more features like texture size peel type etc.. I would like K to be the smallest num that will return the most accurate X results therefore I would like to know what is the next feature I should ask the user for. My DB consist of fruits classified to features (arbitrary values).

Is it a machine learning problem? What is the algoritm I should use and which implementation I should use. I've tried to look in scikit-learn, nltk, weka for suitable algorithm to answer this problem. Either those algorithm are not suitable for answering this problem or I need more specific guiding using them.

Thank you!

Stephen · Accepted Answer

Yes it is.

A decision tree projects the points on to each feature and finds the best split. This split can be determined by different metrics, for example: gini index or entropy (information gain) Sci-kit learn has this in sklearn.tree

Suppose you have 5 data points:

 color   shape   fruit
 orange  oblong  orange
 red     round   apple
 orange  round   orange
 red     oblong  apple
 red     round   apple

So to train you would do something like this:

feature   class  |  feature  class
orange    orange |  oblong   orange
red       apple  |  round    apple
orange    orange |  round    orange
red       apple  |  oblong   apple
red       apple  |  round    apple

As you can see the best split is color because for this dataset if color=red then fruit = apple, and if color = orange then fruit = orange.

Training on these data points you would have the decision tree:

        color
___________________
|                 |
|                 |
red               orange
apple             orange

In real life these splits would be based on numerical values i.e num > .52.

As for what algorithms to use for this, it depends. You'll have to do the research for your own data because it's more of a per dataset/preference kind of thing.

You could use sci-kit learn on the example above like this:

from sklearn.trees import DecisionTreeClassifier
#make your sample matrix 
samples = [[1,1], [0,0], [1,0], [0,1], [0,0]]
#make your target vector ( in this case fruit)
fruitname = [1, 0, 1, 0, 0]
#create and fit the model
dtree =  DecisionTreeClassifier()
dtree =  dtree.fit(samples, fruitname)
#test an unknown red fruit that is oblong
dtree.predict([0,1])

Note that color=1 means that the fruit is orange, and shape=1 means that the fruit is oblong.

Have a look through the sci-kit user guide for a more in-depth overview.

Decision tree implementation for returning the next feature to split the tree

Answers (2)

Related Questions