Using NiFi to pull Elasticsearch Indexes

Question

I am working with NiFi (was recently turned onto it and it seems to suit my needs). We have recently stood up a Spark/Hadoop cluster and have had Elasticsearch in place for about 2 years now. My goal is to get specific indexes from Elasticsearch into HDFS (syslog, specifically). I am doing a machine learning project for anomaly detection, but want to work on the data from HDFS to speed things up.

So, a little background - our syslog indexes are different every day (logstash-syslog-2017-11-20, and so forth). I only need the messages from the syslog, so essentially what I want to do is:

ES -> NiFi -> Parse JSON to give me back text -> write each message to its own line in a text file.

In the end, in my HDFS I'd have text files of messages for each index (day), such as:

syslog-2017-11-19
syslog-2017-11-20
syslog-2017-11-21

and so forth....

I am stumped on a couple things:

What are the components needed to build this? I see there are GenerateFlowFile, which I think I need to make the index names dynamic.
Since I want to pull an entire index, I think I need to use "ScrollElasticSearchHttp", but I am not sure. There are other options, but I don't know what is best. When using PySpark, I have done simple queries using the ES-Hadoop connector to grab whole indexes, but had to increase the scroll size to 10k, to make it run faster. Just confused on what processor I should be using.

If someone can give me an idea of the structure of this (what processors, connectors, etc.) I need to get indexes of messages from syslog from ES to my HDFS, this would be great. Still learning this, so please excuse my ignorance on this. Thanks so much for any help!

azdatasci · Accepted Answer

I am posting my first comment as the answer, as that is what ended up being my solution.

I ended up using the ScrollElasticsearchHttp processor, as per my commment above, it seemed that I had some options formatted incorrectly. Once I got the right format, it worked. I wish that the NiFi docs had more examples / explicit examples that showed the format and differentiates it from how the ES-Hadoop formatted options are. Anyway things are working now. I'd be interested in looking into writing my own processors - is there a guide or something out there for this?

Using NiFi to pull Elasticsearch Indexes

Answers (2)

Related Questions