Liatz
Liatz

Reputation: 5207

Decision tree implementation for returning the next feature to split the tree

Suppose my data consist of fruits, described by their color and shape and more features. I would like to return maximum of X fruits that have the features the user stated and I would like to do it in the minimum num of questions.

The first questions I always ask the user are what is the color and shape of the fruit. According to the user answer I would like to ask for K more features like texture size peel type etc.. I would like K to be the smallest num that will return the most accurate X results therefore I would like to know what is the next feature I should ask the user for. My DB consist of fruits classified to features (arbitrary values).

Is it a machine learning problem? What is the algoritm I should use and which implementation I should use. I've tried to look in scikit-learn, nltk, weka for suitable algorithm to answer this problem. Either those algorithm are not suitable for answering this problem or I need more specific guiding using them.

Thank you!

Upvotes: 3

Views: 1828

Answers (2)

Stephen
Stephen

Reputation: 2424

Yes it is.

A decision tree projects the points on to each feature and finds the best split. This split can be determined by different metrics, for example: gini index or entropy (information gain) Sci-kit learn has this in sklearn.tree

Suppose you have 5 data points:

 color   shape   fruit
 orange  oblong  orange
 red     round   apple
 orange  round   orange
 red     oblong  apple
 red     round   apple

So to train you would do something like this:

feature   class  |  feature  class
orange    orange |  oblong   orange
red       apple  |  round    apple
orange    orange |  round    orange
red       apple  |  oblong   apple
red       apple  |  round    apple

As you can see the best split is color because for this dataset if color=red then fruit = apple, and if color = orange then fruit = orange.

Training on these data points you would have the decision tree:

        color
___________________
|                 |
|                 |
red               orange
apple             orange

In real life these splits would be based on numerical values i.e num > .52.

As for what algorithms to use for this, it depends. You'll have to do the research for your own data because it's more of a per dataset/preference kind of thing.

You could use sci-kit learn on the example above like this:

from sklearn.trees import DecisionTreeClassifier
#make your sample matrix 
samples = [[1,1], [0,0], [1,0], [0,1], [0,0]]
#make your target vector ( in this case fruit)
fruitname = [1, 0, 1, 0, 0]
#create and fit the model
dtree =  DecisionTreeClassifier()
dtree =  dtree.fit(samples, fruitname)
#test an unknown red fruit that is oblong
dtree.predict([0,1])

Note that color=1 means that the fruit is orange, and shape=1 means that the fruit is oblong.

Have a look through the sci-kit user guide for a more in-depth overview.

Upvotes: 1

dmg
dmg

Reputation: 7716

Yes, this is a machine learning problem (in a way). I would suggest the decision tree approach for which there are many different algorithms. ID3 and C4.5 are simple algorithms that will help you in minimizing the depth as it bases the next question (next feature to split the tree) on the maximum information gain.

Upvotes: 0

Related Questions