Reputation: 8845
I have 10.000 HTML pages.
I know that some are build with the same CMS systems, and hence has "kind of" the same structure, though not exactly alike. I expect there to be around 100 different CMS's but I don't know them beforehand, so I can't look for predefined patterns.
This is why I need an algorithm to calculate a similarity measure for each page and then cluster them based on similarity..?
I would be happy find some tools in Ruby, but other languages are also welcome.
PS. I do not wan't to look at similarity in content (text). Only wanna do comparison at meta level. Like HTML structure, css rules, class names etc. etc.
Upvotes: 3
Views: 1217
Reputation: 160581
In a past life I wrote a lot of analytics software for a company that had to dig through a huge number of pages, easily the number you're talking about, to return similar types of information.
No matter how you want to determine similarity, you have to write the rules yourself. Pages vary too much, and code can't really understand what "similar" means, nor can it determine what is important to your particular use.
Things you can do:
At the end though, you have to look through the pages and determine what is important and no other programmer can guess what those might be.
HTML structure, the order of the individual tags, isn't nearly as useful as it used to be, since CSS and JavaScript can move things all over the page once it's loaded into a browser, so what the eye sees can vary greatly from what standard code-based tools see. Two versions of the same CMS can have radically different output but, as a result of the CSS/JavaScript, appear the same to viewers, so again, you have to determine how to correlate them.
Upvotes: 2