Reputation: 4055
I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.
For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.
Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.
I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?
I can't find any example on a morphline adapted for this map reduce job.
Thank you respectfully.
Upvotes: 1
Views: 317
Reputation: 4544
In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr
to create index.
For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:
SOLR_LOCATOR : {
collection : collection1
zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readAvroContainer {}
}
{
extractAvroPaths {
flatten : false
paths : {
id : /id
field1_s : /field1
field2_s : /field2
}
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.
This is command I'm using to index files and merge them to running collection:
sudo -u hdfs hadoop --config /etc/hadoop/conf \
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file /local/path/morphlines_file \
--output-dir hdfs://localhost/mrit/out \
--zk-host localhost:2181/solr \
--collection collection1 \
--go-live \
hdfs:/mrit/in/my-avro-file.avro
Solr should be configured to work with HDFS and collection should exist.
All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.
Upvotes: 1
Reputation: 5547
Cloudera has a guide which is having almost similar use case given under morphline
.
In this figure, a Flume Source receives syslog events and sends them to a Flume Morphline Sink, which converts each Flume event to a record and pipes it into a readLine command. The readLine command extracts the log line and pipes it into a grok command. The grok command uses regular expression pattern matching to extract some substrings of the line. It pipes the resulting structured record into the loadSolr command. Finally, the loadSolr command loads the record into Solr, typically a SolrCloud. In the process, raw data or semi-structured data is transformed into structured data according to application modelling requirements.
The use case given in the example is what production tools like MapReduceIndexerTool
, Apache Flume Morphline Solr Sink
and Apache Flume MorphlineInterceptor
and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:
Upvotes: 1