Reputation: 655
I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching.
Now my problem is, when I try to re-crawl some site like trailer.apple.com or any other site, it is always crawl the last crawled urls. Even I have removed the last crawled urls from seeds.txt file and entered the new Urls. But Nutch is not crawling the new Urls.
Can anybody tell me, what actually I am doing wrong.
Also please suggest me any Nutch Plugin that can help for crawling the videos and movies sites.
Any help will really appreciable.
Upvotes: 2
Views: 3300
Reputation: 1693
u just add ur nutch-site.xml below property tag. it works for me ,,,,,,,check it..........
<property>
<name>file.crawl.parent</name>
<value>false</value>
</property
and u just change regex-urlfilter.txt
# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
# accept anything else
+.
after remove that indexing dir manual or command also like.. rm -r $NUTCH_HOME/indexdir
after run ur crawl cammand...........
Upvotes: 0
Reputation: 1102
I have the same problem. Nutch re-crawl only the old urls, even they not exist in seed.txt.
First time when I start nutch I do the following:
Add domain "www.domain01.com" in /root/Desktop/apache-nutch 2.1/runtime/local/urls/seed.txt (without quotes)
In /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
In /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt, add new line:
# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
... and everything was fine.
Next I made the following changes:
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt and add two new domains: www.domain02.com and www.domain03.com
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt and add two new lines:
# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Next I execute the following commands:
updatedb
bin/nutch inject urls
bin/nutch generate urls
bin/nutch updatedb
bin/nutch crawl urls -depth 3
And nutch still crawl the www.domain01.com
I don't know why ?
I use Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux is started on virtual machine on Windows 7 (x64).
Upvotes: 2
Reputation: 113
This post is a bit outdated but still valid for the most parts: http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ perhaps the last crawled pages are the ones that change the most. Nutch uses an adaptative algorithm to schedule re-crawls, so when a page is very static it should not be re-crawled very often. You can override how often you want to recrawl using nutch-site.xml. Also, the seed.txt file is supposed to be a seed list, once that you inject the URLs Nutch does not use it anymore(unless you manually re-inject it again)
Another configuration that may help is your regex-urlfilter.txt, if you want to point to an specific place or exclude certain domains/pages etc.
Cheers.
Upvotes: 1