Reputation: 62
I used nutch and Elastisearch to crawl/parse 99 websites/links in order to index them in Elasicsearch so that I can use the search engine. It did crawl all the 99 websites/links but the end message I get is as follows. I am trying to understand what redirects, add/update mean? And if it is possible to find out which are gone and redirects?
Indexer: number of documents indexed, deleted, or skipped:
Indexer: 5 deleted (gone)
Indexer: 8 deleted (redirects)
Indexer: 76 indexed (add/update)
Indexer: finished at 2020-12-17 13:07:19, elapsed: 00:00:08
Upvotes: 1
Views: 80
Reputation: 62
"Gone" means that the website or document is no longer available. This can occur if the website or document has been deleted or if the URL has changed.
"Redirects" means that the website or document has been moved to a new URL. When a website or document is redirected, the old URL will automatically redirect to the new URL. This is often done to update the URL of a website or document or to consolidate multiple URLs into one.
The "add/update" status means that the website or document has been successfully indexed and either added as a new entry in the Elasticsearch index or updated if it already exists.
To find out which websites or documents were deleted or redirected, you can check the logs or try accessing the URLs of the websites or documents to see if they are still available or if they redirect to a new URL. You can also check the Elasticsearch index to see if the websites or documents are still present.
Upvotes: 0
Reputation: 2239
Nutch does not know whether a page is already in the index. In order to keep the index and the crawled content in sync,
-deleteGone
) 404s and otherwise failed fetches are deleted from the index and counted as "gone"And if it is possible to find out which are gone and redirects?
You can use the Nutch tools
readdb
to dump the CrawlDbreadseg
to dump the segment which was indexedand then search for 404s, fetch failures, redirects, etc. Calling bin/nutch readdb
resp. bin/nutch readseg
will show you all available command-line options.
Upvotes: 1