Poulos Spyros
Poulos Spyros

Reputation: 39

How to conduct a web crawl for specific topic via Apache Nutch?

I'm new to this field and as a student we have to create a web portal for a specific topic. As a first step we have to crawl the web (or part of it) so we can gather links for this topic before we index and rank them with the final purpose to feed them as database for our portal.

The thing is that I cannot come up to the right methodology. Let's say the theme of our portal is "health insurance".

  1. What are the steps i have to follow as methodology and the tools I need?
  2. Is there a way to guide nutch for specific content?
  3. Should I fill my seeds.txt with a wide range of links parse a lot of links and then filter the content?

You can describe steps on high-level and i'll do the research how to implement.

Upvotes: 2

Views: 501

Answers (3)

Poulos Spyros
Poulos Spyros

Reputation: 39

Nutch is coming with a built in NaiveBayesParseFilter. You have to add the following property in nutch-site.xml and also create a training file as described below. From my experience It performs great even with a handful of documents for train. of course the more the merrier.

<property>
<name>plugin.includes</name>
<value>parsefilter-naivebayes</value>
</property>
<property>
  <name>parsefilter.naivebayes.trainfile</name>
  <value></value>
  <description>Set the name of the file to be used for Naive Bayes training. The format will be:
Each line contains two tab seperated parts
There are two columns/parts:
1. "1" or "0", "1" for relevant and "0" for irrelevant document.
3. Text (text that will be used for training)

Each row will be considered a new "document" for the classifier.
CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier.

  </description>
</property>

Upvotes: 1

Jorge Luis
Jorge Luis

Reputation: 3253

By default Nutch only cares about which links to crawl next (either in the current or next crawl cycle). The concept of "next URL" is controlled within Nutch by a scoring plugin.

Since NUTCH-2039 was merged Nutch now supports a "relevance based scoring". This means that you can define a gold standard (your ideal page) and let the crawler score each potential URL to crawl based on how similar the new link is to your ideal case. This provides (to some extent) a topic based crawler.

You can take a look a https://cwiki.apache.org/confluence/display/nutch/SimilarityScoringFilter to see how to enable/configure this plugin.

Upvotes: 1

rzo1
rzo1

Reputation: 5751

Introduction

What you are trying to build is a so-called focused crawler or topical crawler, which only collects data, which is in your specific domain of interest.

There are a lot of different (scientific) approaches on how to develop such system. It often involves statistical methods or machine learning to estimate the similarity of a certain Web page to your topic. Next, the selection of seed points is crucial for this approach. I would recommend to use a search-engine to collect high quality seeds for your domain of interest. As an alternative you could use pre-classified URLs from Web directories such as curlie.org.

A good literature review on this topic with some in-depth explanation of different approaches is a journal paper by Kumar et al..

Process in Short

In short, the process of implementing such a system would be:

  1. Build a relevance model, which can decide, if a given Web page belongs to your domain of interest / topic (e.g. a text classifier).
  2. Evaluate your domain-specific relevance model. If you are not satisfied, go back to (1)
  3. Feed your high quality seed points into the system and start the crawl

Architecture

A more or less general (focused) crawler architecture (on a single server/pc) looks like this:

basic crawler architecture

Disclaimer: Image is my own work. Please respect this by referencing this post.

Apache Nutch

Sadly, Apache Nutch cannot do this by default. You have to implement the additional logic as a plugin. An inspiration on how to do this might be anthelion, which was a focused crawler plugin for Nutch. However, it is not actively maintained anymore.

Upvotes: 3

Related Questions