prerna Keshari
prerna Keshari

Reputation: 462

Indexing log files using Solr

I have to index log files which are sitting in a recursive directory structure (each directory can have one or more files and directories). Log files have all kind of different extensions. Search will be based on text of log file. All the files containing a particular string (search keyword) should come out along with its name and full path as result of search.

I tried to use DIH tika for this but seems to work only for one file. I tried FileListEntityprocessor, but couldn't get it working.

How can I index these log files using Solr. Please help me if someone has does the same.

Thanks in advance.

P.S. Individual log files are not very large but overall data is huge.

Upvotes: 1

Views: 2040

Answers (2)

prerna Keshari
prerna Keshari

Reputation: 462

TikaEntityProcessor can be used with FileListEntityProcessor.

data-config.xml

<dataConfig>
    <dataSource name="bin" type="BinFileDataSource"/>
    <document>
        <entity name="f" dataSource="null" rootEntity="false"
            processor="FileListEntityProcessor" transformer="TemplateTransformer"
            baseDir="L:/Documents/65923/"
            fileName=".*\.*" onError="skip" recursive="true">

            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastmodified" />

            <entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" onError="skip">
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="content"/>   
            </entity>
        </entity>
    </document>
</dataConfig>

Upvotes: 1

D_K
D_K

Reputation: 1420

I would do something like this:

  1. have one system producing input directories that match your search.
  2. Have a piece of functionality that will parse the matched logs or their parts in those directories into in RAM solr documents.
  3. Stream documents from a directory or set of directories into solr via iterator:

    HttpSolrServer server = new HttpSolrServer();
    Iterator<SolrInputDocument> iter = new Iterator<SolrInputDocument>(){
      public boolean hasNext() {
          boolean result ;
          // set the result to true false to say if you have more documensts
          return result;
      }
    
      public SolrInputDocument next() {
          SolrInputDocument result = null;
          // construct a new document here and set it to result
          return result;
      }
    };
    server.add(iter);
    

See this and other methods here: http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update

Upvotes: 1

Related Questions