mrquestion
mrquestion

Reputation: 195

how can I create decision tree (ID3)?

Based on this article: http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm I would like to build decision tree to classify my test document.

TRAIN SET
documents in class1:
text in document1: Chinese Beijing Chinese
text in document2: Chinese Chinese Shanghai
text in document3: Chinese Macao
documents in class2:
text in document4: Tokyo Japan Chinese

TEST SET (I must decide to which class below document belongs)
text in document5: Chinese Chinese Chinese Tokyo Japan Pekin


So first as I understand I must compute Entropy(S) so:
Entropy(S) = - (|documents in class1| / |documents in all classes|) * log2(|documents in class1| / |documents in all classes|) - (|documents in class2| / |documents in all classes|) * log2(|documents in class2| / |documents in all classes|) = - (3/4)log2(3/4) - (1/4)log2(1/4) = - 0.22 - (- 0.35) = 0.13
yes?

Based on this article I should now compute Entropy(Sweak) and Entropy(Sstrong) - but what should I compute in my case? I have documents, words and classes.

Upvotes: 0

Views: 859

Answers (1)

Oswald
Oswald

Reputation: 31685

A decision tree tells you in which order to look at the features of an item to classify that item.

Obviously, you can look at the features in any order you like until you have reached a decision about how to classify an item. This could potentially result in a large and therefore very specific decision tree. This is called overfitting. A small decision tree is preferred, because a small decision tree is less likely to classify the examples correctly and new items incorrectly.

ID3 is an algorithm for creating small decision trees, that classifies the examples correctly. It does that by calculating the entropy of the features. The entropy tells you, how much information you gain by examining that feature. So the first question you have to answer is: what are the features of the items you want to classify?

For example to classify project offers into accept and reject, you might want to look at past projects and record the following features for each:

  • How long did it take to complete?
  • How much did you get paid (effectively per invested hour)?
  • Which tools and languages where used?
  • How interested are you in the software personally?
  • Who challenging was the project?
  • Was the work done at home or on site?
  • Was the communication with the customer satisfactory?

Also record whether you would do the same project again. Use this to classify the projects. Now you can build a decision tree based on the features of your past projects and their classification.

Upvotes: 2

Related Questions