Reputation: 113
I have installed Apache Nutch 2.3.1 on top of Hadoop(2.5.2) multi node clusters (AWS EC2 machines). I have configured Nutch files accordingly(On master node). I have moved seed.txt file(which has urls to be crawled) from master to Hdfs file system. Now, I run the following command to crawl,
bin/hadoop jar /home/ubuntu/nutch/runtime/deploy/apache-nutch-2.3.1.job org.apache.nutch.crawl.Crawl urls -dir crawl -depth 1 -topN 5
I'm getting error,
Exception in thread "main" java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.util.RunJar.main(RunJar.java:205)
I have installed java - 1.8.0_151. I found that Crawl Class in not found in this java version. So, should we replace java1.8 with java1.7 version or issue is with other thing.
Help me out of this issue.
Upvotes: 0
Views: 287
Reputation: 2239
The class org.apache.nutch.crawl.Crawl
has been removed since many years. It's recommended to run the shell script bin/crawl instead. It will launch Hadoop jobs for every step of a crawl: inject, generate, fetch, parse, etc. Alternatively, you can run each step via bin/nutch, cf. https://wiki.apache.org/nutch/Nutch2Tutorial
Upvotes: 3