Apache Nutch 2.3.1 checkingpointing not working

Question

I have configured apache Nutch 2.3.1 with single node cluster (Hadoop 2.7.x and hbase 1.2.6). I have to check its checkpointing feature. according to my information, resuming is available in Fetch and parse. I assume that at any stage during fetching (or parsing), my complete cluster goes down due to some problem e.g. power failure. I assume that when I restart cluster and crawler with -resume flag, it should start to fetch only those URLs that was not fetched.

But what I observed is that (with debug enabled) that it start to refetch all URLs (with same batchID) till end even with resume flag. Resume flag only works when a job (e.g. fetch) was completely finished. I have cross checked it from its logs with a message like "Skipping express.pk; already fetched".

Does my interprestation is no correct about resume option in Nutch?

or there is some problem in cluster/configuration ?

Apache Nutch 2.3.1 checkingpointing not working

Answers (1)

Related Questions