muntakim360
muntakim360

Reputation: 97

Re-crawl Nutch periodically using cron job

I've successfully crawled a website using Nutch 1.12 and indexed it in Solr 6.1 using the below command:

[root@2a563cff0511 nutch-latest]# bin/crawl -i \
> -D solr.server.url=http://192.168.99.100:8983/solr/test/ urls/ crawl 5

When I run the above command again then it says the following:

[root@2a563cff0511 nutch-latest]# bin/crawl -i \
> -D solr.server.url=http://192.168.99.100:8983/solr/test/ urls/ crawl 5
Injecting seed URLs
/opt/nutch-latest/bin/nutch inject crawl/crawldb urls/
Injector: starting at 2016-06-19 15:29:08
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 1
Injector: Total new urls injected: 0
Injector: finished at 2016-06-19 15:29:13, elapsed: 00:00:05
Sun Jun 19 15:29:13 UTC 2016 : Iteration 1 of 1
Generating a new segment
/opt/nutch-latest/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.
speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments
-topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2016-06-19 15:29:15
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

However, I made some changes i.e. new file is being added and an existing file has been changed.

Upvotes: 0

Views: 1076

Answers (1)

m5khan
m5khan

Reputation: 2717

You used bin/crawl command which executes complete crawl cycles. When you executed the command first time, it goes up to depth 5 i.e it executed 5 cycles.

Now when you are running the same command again, providing the same segment folder, in your case crawl, it will try to fetch the pages from the depth 6 because crawl DB already have pages retrieved from the earlier 5 crawls marked as fetched.

There could be certain factors for not fetching more pages.

One reason could be either there are no more links available that you are trying to fetch. If you have restricted the URLs to fetch in NUTCH_HOME/conf/regex-urlfilter.txt this could be the possibility.

There could be other constrains too in your configurations, checkout my answer on How to increase number of documents fetched by Apache Nutch crawler


As I percieve from your question's title: "Re-crawl nutch periodically using cronjob". If you want to again re-crawl the pages from scratch, then you should change or remove the folder where all your nutch's crawldb, linkdb and segments are saved. In your case "crawl" folder. This will not continue your crawling from the last crawl process but will start from ground zero again.

You can also check out this post.

Upvotes: 1

Related Questions