Reputation: 841
I am working with NiFi (was recently turned onto it and it seems to suit my needs). We have recently stood up a Spark/Hadoop cluster and have had Elasticsearch in place for about 2 years now. My goal is to get specific indexes from Elasticsearch into HDFS (syslog, specifically). I am doing a machine learning project for anomaly detection, but want to work on the data from HDFS to speed things up.
So, a little background - our syslog indexes are different every day (logstash-syslog-2017-11-20, and so forth). I only need the messages from the syslog, so essentially what I want to do is:
ES -> NiFi -> Parse JSON to give me back text -> write each message to its own line in a text file.
In the end, in my HDFS I'd have text files of messages for each index (day), such as:
syslog-2017-11-19
syslog-2017-11-20
syslog-2017-11-21
and so forth....
I am stumped on a couple things:
What are the components needed to build this? I see there are GenerateFlowFile, which I think I need to make the index names dynamic.
Since I want to pull an entire index, I think I need to use "ScrollElasticSearchHttp", but I am not sure. There are other options, but I don't know what is best. When using PySpark, I have done simple queries using the ES-Hadoop connector to grab whole indexes, but had to increase the scroll size to 10k, to make it run faster. Just confused on what processor I should be using.
If someone can give me an idea of the structure of this (what processors, connectors, etc.) I need to get indexes of messages from syslog from ES to my HDFS, this would be great. Still learning this, so please excuse my ignorance on this. Thanks so much for any help!
Upvotes: 1
Views: 3336
Reputation: 841
I am posting my first comment as the answer, as that is what ended up being my solution.
I ended up using the ScrollElasticsearchHttp processor, as per my commment above, it seemed that I had some options formatted incorrectly. Once I got the right format, it worked. I wish that the NiFi docs had more examples / explicit examples that showed the format and differentiates it from how the ES-Hadoop formatted options are. Anyway things are working now. I'd be interested in looking into writing my own processors - is there a guide or something out there for this?
Upvotes: 0
Reputation: 321
There is also ListenBeats
processor. You can redirect Logstash to NiFi and Nifi could write to both EL and HDF. It's true that this will put NiFi on your critical path.
There is also the possibility to write your own processors and you can do that very easily. Just follow this article
I've also found Nifi recently and I think is great. Played a bit with it and thus I am no expert.
Upvotes: 1