Pico
Pico

Reputation: 603

How to categorize urls using machine learning?

I'm indexing websites' content and I want to implement some categorization based solely on the urls.

I would like to tell appart content view pages from navigation pages. By 'content view pages' I mean webpages where one can typically see the details of a product or a written article. By 'navigation pages' I mean pages that (typically) consist of lists of links to content pages or to other more specific list pages.

Although some sites use a site wide key system to map their content, most of the sites do it bit by bit and scope their key mapping, so this should be possible.

In practice, what I want to do is take the list of urls from a site and group them by similarity. I believe this can be done with machine learning, but I have no idea how. Machine learning appear to be a broad topic, what should I start reading about in particular? Which concepts, which algoritms, which tools?

Upvotes: 5

Views: 3036

Answers (3)

Ben Allison
Ben Allison

Reputation: 7394

If you want to discover these groups automatically, I suggest you find yourself an implementation of a clustering algorithm (K-Means is probably the most popular, you don't say what language you want to do this in). You know there are two categories, so something that allows you to specify the number of categories a priori will make the problem easier.

After that, define a bunch of features for your webpages, and run them through k-means to see what kind of groups are produced. Tweak the features you use til you get something that looks satisfactory. If you have access to the webpages themselves, I'd strongly recommend using features defined over the whole page, rather than just the URLs.

Upvotes: 3

greeness
greeness

Reputation: 16104

I feel like you are trying to classify the Authority and Hub in a HITS algorithm.

  • Hub is your navigation page;
  • Authority is your content view page.

By doing a link analysis of every web pages, you should be able to find out the type of page by performing HITS on all the webpages in a domain. As shown in below graphs, the left graph shows the link relation between webpages. The right graph shows the scoring with respective to hub/authority after running HITS. HITS does not need any label to start. The updating rule is simple: basically just one update for authority score and another update for hub score.

enter image description here enter image description here

Here is a tutorial discussing pagerank/HITS where I borrowed the above two graphs.

Here is an extended version of HITS to combine HITS and information retrieval methods (TF-IDF, vector space model, etc). This looks much more promising but certainly it needs more work. I suggest you start with naive HITS and see how good it is. On top of that, try some techniques mentioned in BHITS to improve your performance.

Upvotes: 2

Steve
Steve

Reputation: 21499

You firstly need to collect a dataset of navigation / content pages and label them. After that its quite straight forward.

What language will you be using? I'd suggest you try Weka which is a java based tool in which you can simply press a button and get back performance measures of 50 odd algorithms from. After that you will know which is the most accurate and can deploy that.

Upvotes: 2

Related Questions