Reputation: 11
I've successfully run Nutch (v1.4) for a crawl using local mode on my Ubuntu 11.10 system. However, when switching over to "deploy" mode (all else being the same), I get an error during the fetch cycle.
I have Hadoop running succesfully on the machine, in a pseudo-distributed mode (replication factor is 1 and I have just 1 map and 1 reduce job setup). "jps" shows that all Hadoop daemons are up and running. 18920 Jps 14799 DataNode 15127 JobTracker 14554 NameNode 15361 TaskTracker 15044 SecondaryNameNode
I have also added the HADOOP_HOME/bin path to my PATH variable.
PATH=$PATH:/home/jimb/hadoop/bin
Then I ran the crawl from the nutch/deploy directory, as below:
bin/nutch crawl /data/runs/ar/seedurls -dir /data/runs/ar/crawls
Here is the output I get:
12/01/25 13:55:49 INFO crawl.Crawl: crawl started in: /data/runs/ar/crawls
12/01/25 13:55:49 INFO crawl.Crawl: rootUrlDir = /data/runs/ar/seedurls
12/01/25 13:55:49 INFO crawl.Crawl: threads = 10
12/01/25 13:55:49 INFO crawl.Crawl: depth = 5
12/01/25 13:55:49 INFO crawl.Crawl: solrUrl=null
12/01/25 13:55:49 INFO crawl.Injector: Injector: starting at 2012-01-25 13:55:49
12/01/25 13:55:49 INFO crawl.Injector: Injector: crawlDb: /data/runs/ar/crawls/crawldb
12/01/25 13:55:49 INFO crawl.Injector: Injector: urlDir: /data/runs/ar/seedurls
12/01/25 13:55:49 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
12/01/25 13:56:53 INFO mapred.FileInputFormat: Total input paths to process : 1
...
...
12/01/25 13:57:21 INFO crawl.Injector: Injector: Merging injected urls into crawl db.
...
12/01/25 13:57:48 INFO crawl.Injector: Injector: finished at 2012-01-25 13:57:48, elapsed: 00:01:59
12/01/25 13:57:48 INFO crawl.Generator: Generator: starting at 2012-01-25 13:57:48
12/01/25 13:57:48 INFO crawl.Generator: Generator: Selecting best-scoring urls due for fetch.
12/01/25 13:57:48 INFO crawl.Generator: Generator: filtering: true
12/01/25 13:57:48 INFO crawl.Generator: Generator: normalizing: true
12/01/25 13:57:48 INFO mapred.FileInputFormat: Total input paths to process : 2
...
12/01/25 13:58:15 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
12/01/25 13:58:16 INFO crawl.Generator: Generator: segment: /data/runs/ar/crawls/segments/20120125135816
...
12/01/25 13:58:42 INFO crawl.Generator: Generator: finished at 2012-01-25 13:58:42, elapsed: 00:00:54
12/01/25 13:58:42 ERROR fetcher.Fetcher: Fetcher: No agents listed in 'http.agent.name' property.
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Now, the configuration files for the "local" mode are setup fine (since a crawl in local mode succeeded). For running in deploy mode, since the "deploy" folder did not have any "conf" subdirectory, I assumed that either: a) the conf files need to be copied over under "deploy/conf", OR b) the conf files need to be placed onto HDFS.
I have verified that option (a) above does not help. So, I'm assuming that the Nutch configuration files need to exist in HDFS, for the HDFS fetcher to run successfully? However, I don't know at what path within HDFS I should place these Nutch conf files, or perhaps I'm barking up the wrong tree?
If Nutch reads config files during "deploy" mode from the files under "local/conf", then why is it that the local crawl worked fine, but the deploy-mode crawl isn't?
What am I missing here?
Thanks in advance!
Upvotes: 1
Views: 1333
Reputation: 6169
Try this out:
In the nutch source directory, modify the file conf/nutch-site.xml
to set http.agent.name
properly.
re-build the code using ant
Go to runtime/deploy
directory, set the required environment variables and try crawling again.
Upvotes: 2
Reputation: 11
This is likely because you have not rebuilt yet. Can you run "ant" and see what happens? Obviously, you need to update the http.agent.name in nutch-site.xml if you have not done so yet.
Upvotes: 1