Reputation: 366
I need to crawl two websites and index them into elasticsearch as two different indexes or types. I am using nutch 1.15 with elasticsearch-5.3.3
How can we crawl two different sites and index them separately in elasticsearch in nutch? Can this be achieved in single instance of nutch?
Upvotes: 0
Views: 173
Reputation: 3298
At the moment there is nothing in Nutch to do document routing. For instance, if you use the index-jexl-filter
, the filtering is done before is the document is sent to the Nutch writers. You can configure multiple Index writers (2) and then the documents will be sent to both Index writers. These writers could be writing to different indexes/document types, but all documents will end in both indexes/document types.
That been said, if you find a way of do the filtering in the ES side, you could configure those Index Writers and route the documents to both of them. Then filter in ES at ingestion time (perhaps something like a script
in ES that prevents the document for begin ingested if it doesn't match certain requirement. But I cannot out of the top of my mind, pin point to something specific that does this right now.
Also, you can just clone the elastic indexer and customise it so that the type
is extracted from the document itself.
EDIT
Thanks to @sebastian-nagel for pointing this out.
I totally missed the https://nutch.apache.org/apidocs/apidocs-1.15/org/apache/nutch/exchange/jexl/JexlExchange.html exchange that does exactly what you want. With this is posible to do document routing at indexing time, using a JEXL expression.
Upvotes: 0