Niels Kristian
Niels Kristian

Reputation: 8845

Algorithm to detect if two HTML pages are similar?

I have 10.000 HTML pages.

I know that some are build with the same CMS systems, and hence has "kind of" the same structure, though not exactly alike. I expect there to be around 100 different CMS's but I don't know them beforehand, so I can't look for predefined patterns.

This is why I need an algorithm to calculate a similarity measure for each page and then cluster them based on similarity..?

I would be happy find some tools in Ruby, but other languages are also welcome.

PS. I do not wan't to look at similarity in content (text). Only wanna do comparison at meta level. Like HTML structure, css rules, class names etc. etc.

Upvotes: 3

Views: 1217

Answers (1)

the Tin Man
the Tin Man

Reputation: 160581

In a past life I wrote a lot of analytics software for a company that had to dig through a huge number of pages, easily the number you're talking about, to return similar types of information.

No matter how you want to determine similarity, you have to write the rules yourself. Pages vary too much, and code can't really understand what "similar" means, nor can it determine what is important to your particular use.

Things you can do:

  • Determine the total size of the "text" nodes (viewable and invisible text plus CSS and JavaScript. You could get the sizes of the last two and subtract that from the overall text size to get an idea of the total content, but that won't take into account the affect CSS or JavaScript has on the visible page.
  • Look in meta tags for useful information, like keywords or related pages.
  • Look for tables and get counts of their rows and cells and the size of their text, and possibly search for data to correlate or compare.
  • Look for links and anchors and get the similarity of their text and/or hrefs.
  • Look for images and anything with "alt" text and then compare those.

At the end though, you have to look through the pages and determine what is important and no other programmer can guess what those might be.

HTML structure, the order of the individual tags, isn't nearly as useful as it used to be, since CSS and JavaScript can move things all over the page once it's loaded into a browser, so what the eye sees can vary greatly from what standard code-based tools see. Two versions of the same CMS can have radically different output but, as a result of the CSS/JavaScript, appear the same to viewers, so again, you have to determine how to correlate them.

Upvotes: 2

Related Questions