Reputation: 5
With StormCrawler 2.3-SNAPSHOT, setting "maxDepth": 0 in the urlfilters.json prevents the seed injection into the ES index. Is that the expected behaviour? Or should it be injecting the seeds and do a closed crawl on the injected seeds only with no redirection at all? (what I was expecting)
Launch looks fine but ES status index is empty.
Upvotes: 0
Views: 37
Reputation: 4864
See MaxDepthFilter, with a value of 0, everything gets filtered. Setting the filter to a value of 1 should do the trick, the seeds will be injected but their links won't be followed.
In MaxDepthFilter,
private String filter(final int depth, final int max, final String url) {
// deactivate the outlink no matter what the depth is
if (max == 0) {
return null;
}
if (depth >= max) {
LOG.debug("filtered out {} - depth {} >= {}", url, depth, maxDepth);
return null;
}
return url;
}
turns out that URLs need to have a depth of max-1 to be kept, so to put it differently, the actual maximum depth is max-1.
This feels not right and slightly confusing, I agree.
I think this is due to the sequence in which the outlinks get filtered. Often, this is done in the StatusEmitterBolt.
At the moment they first get filtered then inherit their metadata from the parent metadata. It is during that later step that their depth value gets incremented. I suspect this is why we are doing the max-1 trick.
There probably was a reason why the filtering was done first then the metadata inheritance but it has been a while and I can't remember any. I would be happy to change the order and get the metadata then filter and change the depth filtering so that it is more intuitive. Could you please open an issue on Github so that we discuss it there?
Thanks!
Upvotes: 0