Hafiz Muhammad Shafiq
Hafiz Muhammad Shafiq

Reputation: 8670

Updating Max Depth for Apache-Nutch Crawler in scoring-depth filter is not working

I have setup Apache Nutch 1.18 to crawl the web. For ranking, I am using scoring-depth filter. By default, max depth length is set to 1000 (in each page crawled). Now, I have to update this value (increase for example). I have updated following property in Nutch for this purpose

<property>
  <name>scoring.depth.max</name>
  <value>1500</value>
</property> 

Now, what is happening in Nutch, the _maxdepth_ metadata field for already crawled documents is not going to update. What I am expecting is that this value should be changed so that crawler should crawler further pages in lower depth (when a URL is selected for fetch).

Briefly, how can I updated _maxdepth_ field in crawled documents in Nutch ?

Below is the picture of today example where max depth was set to 2 and later I change to 4. I have also observed an issue that lastModifiedField is set to 0 (I think it should not change or if update then it should be timestamp). enter image description here

Upvotes: 1

Views: 171

Answers (1)

Sebastian Nagel
Sebastian Nagel

Reputation: 2239

how can I update maxdepth field in crawled documents in Nutch ?

There is no out-of-the-box solution for this: the _maxdepth_ field might be also set from the seed list by adding seeds like

https://example.com/ \t _maxdepth_=3

But yes, it might be an improvement to only track the maxdepth for pages found first from a seed with a specific maxdepth set. If so, please report it here.

Modified time: Tue Aug 02 ...

lastModifiedField:0

The value in the ProtocolStatus (_pst_) metadata might be set or not depending on the protocol implementation used to fetch a page. The "modified time" is a field of the CrawlDatum object and is obligatorily and reliably set.

Upvotes: 1

Related Questions