How should look like a morphline for MapReduceIndexerTool?

Question

I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.

For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.

Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.

I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?

I can't find any example on a morphline adapted for this map reduce job.

Thank you respectfully.

Gyanendra Dwivedi · Accepted Answer

Cloudera has a guide which is having almost similar use case given under morphline.

In this figure, a Flume Source receives syslog events and sends them to a Flume Morphline Sink, which converts each Flume event to a record and pipes it into a readLine command. The readLine command extracts the log line and pipes it into a grok command. The grok command uses regular expression pattern matching to extract some substrings of the line. It pipes the resulting structured record into the loadSolr command. Finally, the loadSolr command loads the record into Solr, typically a SolrCloud. In the process, raw data or semi-structured data is transformed into structured data according to application modelling requirements.

The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:

How should look like a morphline for MapReduceIndexerTool?

Answers (2)

Related Questions