Reputation: 366

nutch with elasticsearch creating multiple index/types

I need to crawl two websites and index them into elasticsearch as two different indexes or types. I am using nutch 1.15 with elasticsearch-5.3.3

How can we crawl two different sites and index them separately in elasticsearch in nutch? Can this be achieved in single instance of nutch?

Upvotes: 0

Answers (1)

Jorge Luis

Reputation: 3298

At the moment there is nothing in Nutch to do document routing. For instance, if you use the index-jexl-filter, the filtering is done before is the document is sent to the Nutch writers. You can configure multiple Index writers (2) and then the documents will be sent to both Index writers. These writers could be writing to different indexes/document types, but all documents will end in both indexes/document types.

That been said, if you find a way of do the filtering in the ES side, you could configure those Index Writers and route the documents to both of them. Then filter in ES at ingestion time (perhaps something like a script in ES that prevents the document for begin ingested if it doesn't match certain requirement. But I cannot out of the top of my mind, pin point to something specific that does this right now.

Also, you can just clone the elastic indexer and customise it so that the type is extracted from the document itself.

EDIT

Thanks to @sebastian-nagel for pointing this out.

I totally missed the https://nutch.apache.org/apidocs/apidocs-1.15/org/apache/nutch/exchange/jexl/JexlExchange.html exchange that does exactly what you want. With this is posible to do document routing at indexing time, using a JEXL expression.

Upvotes: 0

nutch with elasticsearch creating multiple index/types

Answers (1)

Related Questions