Algorithm to categorize arbitrary number of HTML documents into topics

Question

I've found algorithms that explain how to compare 2 documents to generate a 'closeness' score. Is there a known algorithm that could be used to read a moderate number of HTML documents (double to triple digits) and group them? Ideally without using a 2-input algorithm on every possible permutation of source documents.

I guess Google News must be using something like this.

Just to clarify, here is an example:

Input: 100 HTML documents
Output:
- 3 categories found:
* CategoryA:  30 documents
* CategoryB:  20 documents
* CategoryC:  5  documents
* Uncategorised: 45 documents

Simeon Visser · Accepted Answer

You should look into algorithms in the area of cluster analysis. You seem to be looking for a very broad method of unsupervised learning but you can improve the quality of the results if you add some additional input to the algorithm before searching for categories.

You will need to come up with a way of comparing the documents or at least enumerating the relevant characteristics (length, frequency of words, et cetera). These can serve as input to the clustering algorithm that you're using. For example, you could define the following characteristics:

number of words
number of images
number of external links
number of words related to geography
number of words related to biology
number of words related to economy
et cetera

The more specific you are about what categories you want, the better the algorithms perform. The above characteristics will give you a vector of number for each document:

(384 , 12,  8, ...,  0)
(1244, 39, 10, ..., 55)
(128 ,  2, 66, ..., 33)
...

A clustering algorithm (such as k-means clustering) can now help you in assigning each document to the most likely cluster. Note that this is just an example. For your particular problem it may be useful to define more specific characteristics for a more specific domain (such as medical articles).

Algorithm to categorize arbitrary number of HTML documents into topics

Answers (1)

Related Questions