Reputation: 1654
I've found algorithms that explain how to compare 2 documents to generate a 'closeness' score. Is there a known algorithm that could be used to read a moderate number of HTML documents (double to triple digits) and group them? Ideally without using a 2-input algorithm on every possible permutation of source documents.
I guess Google News must be using something like this.
Just to clarify, here is an example:
Input: 100 HTML documents
Output:
- 3 categories found:
* CategoryA: 30 documents
* CategoryB: 20 documents
* CategoryC: 5 documents
* Uncategorised: 45 documents
Upvotes: 2
Views: 117
Reputation: 122446
You should look into algorithms in the area of cluster analysis. You seem to be looking for a very broad method of unsupervised learning but you can improve the quality of the results if you add some additional input to the algorithm before searching for categories.
You will need to come up with a way of comparing the documents or at least enumerating the relevant characteristics (length, frequency of words, et cetera). These can serve as input to the clustering algorithm that you're using. For example, you could define the following characteristics:
The more specific you are about what categories you want, the better the algorithms perform. The above characteristics will give you a vector of number for each document:
(384 , 12, 8, ..., 0)
(1244, 39, 10, ..., 55)
(128 , 2, 66, ..., 33)
...
A clustering algorithm (such as k-means clustering) can now help you in assigning each document to the most likely cluster. Note that this is just an example. For your particular problem it may be useful to define more specific characteristics for a more specific domain (such as medical articles).
Upvotes: 1