Reputation: 39
I'm new to this field and as a student we have to create a web portal for a specific topic. As a first step we have to crawl the web (or part of it) so we can gather links for this topic before we index and rank them with the final purpose to feed them as database for our portal.
The thing is that I cannot come up to the right methodology. Let's say the theme of our portal is "health insurance".
seeds.txt
with a wide range of links parse a lot of links and then filter the content? You can describe steps on high-level and i'll do the research how to implement.
Upvotes: 2
Views: 501
Reputation: 39
Nutch is coming with a built in NaiveBayesParseFilter. You have to add the following property in nutch-site.xml and also create a training file as described below. From my experience It performs great even with a handful of documents for train. of course the more the merrier.
<property>
<name>plugin.includes</name>
<value>parsefilter-naivebayes</value>
</property>
<property>
<name>parsefilter.naivebayes.trainfile</name>
<value></value>
<description>Set the name of the file to be used for Naive Bayes training. The format will be:
Each line contains two tab seperated parts
There are two columns/parts:
1. "1" or "0", "1" for relevant and "0" for irrelevant document.
3. Text (text that will be used for training)
Each row will be considered a new "document" for the classifier.
CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier.
</description>
</property>
Upvotes: 1
Reputation: 3253
By default Nutch only cares about which links to crawl next (either in the current or next crawl cycle). The concept of "next URL" is controlled within Nutch by a scoring plugin.
Since NUTCH-2039 was merged Nutch now supports a "relevance based scoring". This means that you can define a gold standard (your ideal page) and let the crawler score each potential URL to crawl based on how similar the new link is to your ideal case. This provides (to some extent) a topic based crawler.
You can take a look a https://cwiki.apache.org/confluence/display/nutch/SimilarityScoringFilter to see how to enable/configure this plugin.
Upvotes: 1
Reputation: 5751
What you are trying to build is a so-called focused crawler or topical crawler, which only collects data, which is in your specific domain of interest.
There are a lot of different (scientific) approaches on how to develop such system. It often involves statistical methods or machine learning to estimate the similarity of a certain Web page to your topic. Next, the selection of seed points is crucial for this approach. I would recommend to use a search-engine to collect high quality seeds for your domain of interest. As an alternative you could use pre-classified URLs from Web directories such as curlie.org.
A good literature review on this topic with some in-depth explanation of different approaches is a journal paper by Kumar et al..
In short, the process of implementing such a system would be:
A more or less general (focused) crawler architecture (on a single server/pc) looks like this:
Disclaimer: Image is my own work. Please respect this by referencing this post.
Sadly, Apache Nutch cannot do this by default. You have to implement the additional logic as a plugin. An inspiration on how to do this might be anthelion, which was a focused crawler plugin for Nutch. However, it is not actively maintained anymore.
Upvotes: 3