Reputation: 8670
I have setup Apache Nutch 1.18 to crawl the web. For ranking, I am using scoring-depth filter. By default, max depth length is set to 1000 (in each page crawled). Now, I have to update this value (increase for example). I have updated following property in Nutch for this purpose
<property>
<name>scoring.depth.max</name>
<value>1500</value>
</property>
Now, what is happening in Nutch, the _maxdepth_
metadata field for already crawled documents is not going to update. What I am expecting is that this value should be changed so that crawler should crawler further pages in lower depth (when a URL is selected for fetch).
Briefly, how can I updated _maxdepth_
field in crawled documents in Nutch ?
Below is the picture of today example where max depth was set to 2 and later I change to 4. I have also observed an issue that lastModifiedField is set to 0 (I think it should not change or if update then it should be timestamp).
Upvotes: 1
Views: 171
Reputation: 2239
how can I update maxdepth field in crawled documents in Nutch ?
There is no out-of-the-box solution for this: the _maxdepth_
field might be also set from the seed list by adding seeds like
https://example.com/ \t _maxdepth_=3
But yes, it might be an improvement to only track the maxdepth for pages found first from a seed with a specific maxdepth set. If so, please report it here.
Modified time: Tue Aug 02 ...
lastModifiedField:0
The value in the ProtocolStatus (_pst_
) metadata might be set or not depending on the protocol implementation used to fetch a page. The "modified time" is a field of the CrawlDatum object and is obligatorily and reliably set.
Upvotes: 1