Reputation: 5207
Suppose my data consist of fruits, described by their color and shape and more features. I would like to return maximum of X fruits that have the features the user stated and I would like to do it in the minimum num of questions.
The first questions I always ask the user are what is the color and shape of the fruit. According to the user answer I would like to ask for K more features like texture size peel type etc.. I would like K to be the smallest num that will return the most accurate X results therefore I would like to know what is the next feature I should ask the user for. My DB consist of fruits classified to features (arbitrary values).
Is it a machine learning problem? What is the algoritm I should use and which implementation I should use. I've tried to look in scikit-learn, nltk, weka for suitable algorithm to answer this problem. Either those algorithm are not suitable for answering this problem or I need more specific guiding using them.
Thank you!
Upvotes: 3
Views: 1828
Reputation: 2424
Yes it is.
A decision tree projects the points on to each feature and finds the best split. This split can be determined by different metrics, for example: gini index or entropy (information gain) Sci-kit learn has this in sklearn.tree
Suppose you have 5 data points:
color shape fruit
orange oblong orange
red round apple
orange round orange
red oblong apple
red round apple
So to train you would do something like this:
feature class | feature class
orange orange | oblong orange
red apple | round apple
orange orange | round orange
red apple | oblong apple
red apple | round apple
As you can see the best split is color because for this dataset if color=red then fruit = apple, and if color = orange then fruit = orange.
Training on these data points you would have the decision tree:
color
___________________
| |
| |
red orange
apple orange
In real life these splits would be based on numerical values i.e num > .52
.
As for what algorithms to use for this, it depends. You'll have to do the research for your own data because it's more of a per dataset/preference kind of thing.
You could use sci-kit learn on the example above like this:
from sklearn.trees import DecisionTreeClassifier
#make your sample matrix
samples = [[1,1], [0,0], [1,0], [0,1], [0,0]]
#make your target vector ( in this case fruit)
fruitname = [1, 0, 1, 0, 0]
#create and fit the model
dtree = DecisionTreeClassifier()
dtree = dtree.fit(samples, fruitname)
#test an unknown red fruit that is oblong
dtree.predict([0,1])
Note that color=1 means that the fruit is orange, and shape=1 means that the fruit is oblong.
Have a look through the sci-kit user guide for a more in-depth overview.
Upvotes: 1
Reputation: 7716
Yes, this is a machine learning problem (in a way). I would suggest the decision tree approach for which there are many different algorithms. ID3 and C4.5 are simple algorithms that will help you in minimizing the depth as it bases the next question (next feature to split the tree) on the maximum information gain.
Upvotes: 0