Jamona Mican
Jamona Mican

Reputation: 1654

Algorithm to categorize arbitrary number of HTML documents into topics

I've found algorithms that explain how to compare 2 documents to generate a 'closeness' score. Is there a known algorithm that could be used to read a moderate number of HTML documents (double to triple digits) and group them? Ideally without using a 2-input algorithm on every possible permutation of source documents.

I guess Google News must be using something like this.

Just to clarify, here is an example:

Input: 100 HTML documents
Output:
- 3 categories found:
* CategoryA:  30 documents
* CategoryB:  20 documents
* CategoryC:  5  documents
* Uncategorised: 45 documents

Upvotes: 2

Views: 117

Answers (1)

Simeon Visser
Simeon Visser

Reputation: 122446

You should look into algorithms in the area of cluster analysis. You seem to be looking for a very broad method of unsupervised learning but you can improve the quality of the results if you add some additional input to the algorithm before searching for categories.

You will need to come up with a way of comparing the documents or at least enumerating the relevant characteristics (length, frequency of words, et cetera). These can serve as input to the clustering algorithm that you're using. For example, you could define the following characteristics:

  • number of words
  • number of images
  • number of external links
  • number of words related to geography
  • number of words related to biology
  • number of words related to economy
  • et cetera

The more specific you are about what categories you want, the better the algorithms perform. The above characteristics will give you a vector of number for each document:

(384 , 12,  8, ...,  0)
(1244, 39, 10, ..., 55)
(128 ,  2, 66, ..., 33)
...

A clustering algorithm (such as k-means clustering) can now help you in assigning each document to the most likely cluster. Note that this is just an example. For your particular problem it may be useful to define more specific characteristics for a more specific domain (such as medical articles).

Upvotes: 1

Related Questions