Reputation: 462
I have to index log files which are sitting in a recursive directory structure (each directory can have one or more files and directories). Log files have all kind of different extensions. Search will be based on text of log file. All the files containing a particular string (search keyword) should come out along with its name and full path as result of search.
I tried to use DIH tika for this but seems to work only for one file. I tried FileListEntityprocessor, but couldn't get it working.
How can I index these log files using Solr. Please help me if someone has does the same.
Thanks in advance.
P.S. Individual log files are not very large but overall data is huge.
Upvotes: 1
Views: 2040
Reputation: 462
TikaEntityProcessor can be used with FileListEntityProcessor.
data-config.xml
<dataConfig>
<dataSource name="bin" type="BinFileDataSource"/>
<document>
<entity name="f" dataSource="null" rootEntity="false"
processor="FileListEntityProcessor" transformer="TemplateTransformer"
baseDir="L:/Documents/65923/"
fileName=".*\.*" onError="skip" recursive="true">
<field column="fileAbsolutePath" name="path" />
<field column="fileSize" name="size" />
<field column="fileLastModified" name="lastmodified" />
<entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" onError="skip">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="content"/>
</entity>
</entity>
</document>
</dataConfig>
Upvotes: 1
Reputation: 1420
I would do something like this:
Stream documents from a directory or set of directories into solr via iterator:
HttpSolrServer server = new HttpSolrServer();
Iterator<SolrInputDocument> iter = new Iterator<SolrInputDocument>(){
public boolean hasNext() {
boolean result ;
// set the result to true false to say if you have more documensts
return result;
}
public SolrInputDocument next() {
SolrInputDocument result = null;
// construct a new document here and set it to result
return result;
}
};
server.add(iter);
See this and other methods here: http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update
Upvotes: 1