Reputation: 134
I have a cluster setup with the following software stack :
nutch-branch-2.3.1, gora-hbase 0.6.1 Hadoop 2.5.2, hbase-0.98.8-hadoop2
So initial command is : inject, generate, fetch, parse, updatedb Out of which first 2 i.e. inject, generate are working fine, but for nutch command (even though its executing successfully) its not fetching any data, and because fetch process is failing its subsequent processes also getting failed.
Please find the logs for counters for each process :
Inject job:
2016-01-08 14:12:45,649 INFO [main] mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=114853
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=836443
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=179217
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=59739
Total vcore-seconds taken by all map tasks=59739
Total megabyte-seconds taken by all map tasks=183518208
Map-Reduce Framework
Map input records=29973
Map output records=29973
Input split bytes=94
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=318
CPU time spent (ms)=24980
Physical memory (bytes) snapshot=427704320
Virtual memory (bytes) snapshot=5077356544
Total committed heap usage (bytes)=328728576
injector
urls_injected=29973
File Input Format Counters
Bytes Read=836349
File Output Format Counters
Bytes Written=0
generate job:
2016-01-08 14:14:38,257 INFO [main] mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=137140
FILE: Number of bytes written=623942
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=937
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=2
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=43788
Total time spent by all reduces in occupied slots (ms)=305690
Total time spent by all map tasks (ms)=14596
Total time spent by all reduce tasks (ms)=61138
Total vcore-seconds taken by all map tasks=14596
Total vcore-seconds taken by all reduce tasks=61138
Total megabyte-seconds taken by all map tasks=44838912
Total megabyte-seconds taken by all reduce tasks=313026560
Map-Reduce Framework
Map input records=14345
Map output records=14342
Map output bytes=1261921
Map output materialized bytes=137124
Input split bytes=937
Combine input records=0
Combine output records=0
Reduce input groups=14342
Reduce shuffle bytes=137124
Reduce input records=14342
Reduce output records=14342
Spilled Records=28684
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1299
CPU time spent (ms)=39600
Physical memory (bytes) snapshot=2060779520
Virtual memory (bytes) snapshot=15215738880
Total committed heap usage (bytes)=1864892416
Generator
GENERATE_MARK=14342
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2016-01-08 14:14:38,429 INFO [main] crawl.GeneratorJob: GeneratorJob: finished at 2016-01-08 14:14:38, time elapsed: 00:01:47
2016-01-08 14:14:38,431 INFO [main] crawl.GeneratorJob: GeneratorJob: generated batch id: 1452242570-1295749106 containing 14342 URLs
Fetching :
../nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 1452242566-14060 -crawlId 1 -threads 50
2016-01-08 14:14:43,142 INFO [main] fetcher.FetcherJob: FetcherJob: starting at 2016-01-08 14:14:43
2016-01-08 14:14:43,145 INFO [main] fetcher.FetcherJob: FetcherJob: batchId: 1452242566-14060
2016-01-08 14:15:53,837 INFO [main] mapreduce.Job: Job job_1452239500353_0024 completed successfully
2016-01-08 14:15:54,286 INFO [main] mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=44
FILE: Number of bytes written=349279
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1087
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=2
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=30528
Total time spent by all reduces in occupied slots (ms)=136535
Total time spent by all map tasks (ms)=10176
Total time spent by all reduce tasks (ms)=27307
Total vcore-seconds taken by all map tasks=10176
Total vcore-seconds taken by all reduce tasks=27307
Total megabyte-seconds taken by all map tasks=31260672
Total megabyte-seconds taken by all reduce tasks=139811840
Map-Reduce Framework
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=28
Input split bytes=1087
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=28
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=426
CPU time spent (ms)=11140
Physical memory (bytes) snapshot=1884893184
Virtual memory (bytes) snapshot=15245959168
Total committed heap usage (bytes)=1751646208
FetcherStatus
HitByTimeLimit-QueueFeeder=0
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2016-01-08 14:15:54,314 INFO [main] fetcher.FetcherJob: FetcherJob: finished at 2016-01-08 14:15:54, time elapsed: 00:01:11
Please advise.
Upvotes: 0
Views: 249
Reputation: 134
Finally after several hours r&d I fond the problem was because of a bug in nutch, which is like "The batch id passed to GeneratorJob by option/argument -batchId <id>
is ignored and a generated batch id is used to mark the current batch.". Listed here as an issue https://issues.apache.org/jira/browse/NUTCH-2143
Special thanks to andrew-butkus :)
Upvotes: 0
Reputation: 777
It's been a while since i worked with nutch, but from memory there is a time to live on fetching a page. for instance if you crawl http://helloworld.com today, and try to issue the fetch command again today, then it will probably just finish without fetching anything as the timetolive on the url http://helloworld.com is belated by x amount of days (forgot the default time to live).
I think you can fix this by clearing the crawl_db and trying again - or there may be a command now to set the timetolive to 0.
Upvotes: 1