Code wrangler
Code wrangler

Reputation: 134

Nutch fetch command not fetching data

I have a cluster setup with the following software stack :

nutch-branch-2.3.1, gora-hbase 0.6.1 Hadoop 2.5.2, hbase-0.98.8-hadoop2

So initial command is : inject, generate, fetch, parse, updatedb Out of which first 2 i.e. inject, generate are working fine, but for nutch command (even though its executing successfully) its not fetching any data, and because fetch process is failing its subsequent processes also getting failed.

Please find the logs for counters for each process :

Inject job:

2016-01-08 14:12:45,649 INFO  [main] mapreduce.Job: Counters: 31
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=114853
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=836443
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=2
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=179217
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=59739
        Total vcore-seconds taken by all map tasks=59739
        Total megabyte-seconds taken by all map tasks=183518208
    Map-Reduce Framework
        Map input records=29973
        Map output records=29973
        Input split bytes=94
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=318
        CPU time spent (ms)=24980
        Physical memory (bytes) snapshot=427704320
        Virtual memory (bytes) snapshot=5077356544
        Total committed heap usage (bytes)=328728576
    injector
        urls_injected=29973
    File Input Format Counters 
        Bytes Read=836349
    File Output Format Counters 
        Bytes Written=0

generate job:

2016-01-08 14:14:38,257 INFO  [main] mapreduce.Job: Counters: 50
    File System Counters
        FILE: Number of bytes read=137140
        FILE: Number of bytes written=623942
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=937
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=1
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=2
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=43788
        Total time spent by all reduces in occupied slots (ms)=305690
        Total time spent by all map tasks (ms)=14596
        Total time spent by all reduce tasks (ms)=61138
        Total vcore-seconds taken by all map tasks=14596
        Total vcore-seconds taken by all reduce tasks=61138
        Total megabyte-seconds taken by all map tasks=44838912
        Total megabyte-seconds taken by all reduce tasks=313026560
    Map-Reduce Framework
        Map input records=14345
        Map output records=14342
        Map output bytes=1261921
        Map output materialized bytes=137124
        Input split bytes=937
        Combine input records=0
        Combine output records=0
        Reduce input groups=14342
        Reduce shuffle bytes=137124
        Reduce input records=14342
        Reduce output records=14342
        Spilled Records=28684
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=1299
        CPU time spent (ms)=39600
        Physical memory (bytes) snapshot=2060779520
        Virtual memory (bytes) snapshot=15215738880
        Total committed heap usage (bytes)=1864892416
    Generator
        GENERATE_MARK=14342
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0
2016-01-08 14:14:38,429 INFO  [main] crawl.GeneratorJob: GeneratorJob: finished at 2016-01-08 14:14:38, time elapsed: 00:01:47
2016-01-08 14:14:38,431 INFO  [main] crawl.GeneratorJob: GeneratorJob: generated batch id: 1452242570-1295749106 containing 14342 URLs

Fetching :

../nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 1452242566-14060 -crawlId 1 -threads 50


2016-01-08 14:14:43,142 INFO  [main] fetcher.FetcherJob: FetcherJob: starting at 2016-01-08 14:14:43
2016-01-08 14:14:43,145 INFO  [main] fetcher.FetcherJob: FetcherJob: batchId: 1452242566-14060
2016-01-08 14:15:53,837 INFO  [main] mapreduce.Job: Job job_1452239500353_0024 completed successfully
2016-01-08 14:15:54,286 INFO  [main] mapreduce.Job: Counters: 50
    File System Counters
        FILE: Number of bytes read=44
        FILE: Number of bytes written=349279
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1087
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=1
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=2
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=30528
        Total time spent by all reduces in occupied slots (ms)=136535
        Total time spent by all map tasks (ms)=10176
        Total time spent by all reduce tasks (ms)=27307
        Total vcore-seconds taken by all map tasks=10176
        Total vcore-seconds taken by all reduce tasks=27307
        Total megabyte-seconds taken by all map tasks=31260672
        Total megabyte-seconds taken by all reduce tasks=139811840
    Map-Reduce Framework
        Map input records=0
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=28
        Input split bytes=1087
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=28
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=426
        CPU time spent (ms)=11140
        Physical memory (bytes) snapshot=1884893184
        Virtual memory (bytes) snapshot=15245959168
        Total committed heap usage (bytes)=1751646208
    FetcherStatus
        HitByTimeLimit-QueueFeeder=0
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0
2016-01-08 14:15:54,314 INFO  [main] fetcher.FetcherJob: FetcherJob: finished at 2016-01-08 14:15:54, time elapsed: 00:01:11

Please advise.

Upvotes: 0

Views: 249

Answers (2)

Code wrangler
Code wrangler

Reputation: 134

Finally after several hours r&d I fond the problem was because of a bug in nutch, which is like "The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored and a generated batch id is used to mark the current batch.". Listed here as an issue https://issues.apache.org/jira/browse/NUTCH-2143

Special thanks to andrew-butkus :)

Upvotes: 0

andrew.butkus
andrew.butkus

Reputation: 777

It's been a while since i worked with nutch, but from memory there is a time to live on fetching a page. for instance if you crawl http://helloworld.com today, and try to issue the fetch command again today, then it will probably just finish without fetching anything as the timetolive on the url http://helloworld.com is belated by x amount of days (forgot the default time to live).

I think you can fix this by clearing the crawl_db and trying again - or there may be a command now to set the timetolive to 0.

Upvotes: 1

Related Questions