Cosmin Ioniță
Cosmin Ioniță

Reputation: 4055

How should look like a morphline for MapReduceIndexerTool?

I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.

For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.

Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.

I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?

I can't find any example on a morphline adapted for this map reduce job.

Thank you respectfully.

Upvotes: 1

Views: 317

Answers (2)

arghtype
arghtype

Reputation: 4544

In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.

For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:

SOLR_LOCATOR : {
  collection : collection1
  zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**"]    
    commands : [
      {
        readAvroContainer {}
      }    
      {
        extractAvroPaths {
          flatten : false
          paths : {
            id : /id
            field1_s : /field1
            field2_s : /field2
          }
        }
      }
      {
        sanitizeUnknownSolrFields {
          solrLocator : ${SOLR_LOCATOR}
        }
      }
      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }
    ]
  }
]

When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.

This is command I'm using to index files and merge them to running collection:

sudo -u hdfs hadoop --config /etc/hadoop/conf \
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file /local/path/morphlines_file  \
--output-dir hdfs://localhost/mrit/out \
--zk-host localhost:2181/solr \
--collection collection1 \ 
--go-live \
hdfs:/mrit/in/my-avro-file.avro

Solr should be configured to work with HDFS and collection should exist.

All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.

Upvotes: 1

Gyanendra Dwivedi
Gyanendra Dwivedi

Reputation: 5547

Cloudera has a guide which is having almost similar use case given under morphline.

enter image description here

In this figure, a Flume Source receives syslog events and sends them to a Flume Morphline Sink, which converts each Flume event to a record and pipes it into a readLine command. The readLine command extracts the log line and pipes it into a grok command. The grok command uses regular expression pattern matching to extract some substrings of the line. It pipes the resulting structured record into the loadSolr command. Finally, the loadSolr command loads the record into Solr, typically a SolrCloud. In the process, raw data or semi-structured data is transformed into structured data according to application modelling requirements.

The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:

enter image description here

Upvotes: 1

Related Questions