Reputation:
Is there a good approach with Solr or a client library feeding into Solr to index an entire hard drive. This should include content in the zip files, including recursively of zip files within zip files?
This should be able to run on Linux (no windows-only clients).
This will of course involve making a single scan over the entire file-system from the root (or any folder actually). I'm not concerned at this point with keeping the index up to date, just creating it initially. This would be similar to the old "Google Desktop" app, which Google discontinued.
Upvotes: 3
Views: 2028
Reputation: 76753
You can manipulate Solr using the SolrJ API.
Here's the API documentation: http://lucene.apache.org/solr/4_0_0/solr-solrj/index.html
And here's a article on how to use SolrJ to index files on your harddrive.
http://blog.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/
Files are represented by InputDocument
and you use .addField
to attach fields that you'd like to search on at a later time.
Here's example code for an Index Driver:
public class IndexDriver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
//TODO: Add some checks here to validate the input path
int exitCode = ToolRunner.run(new Configuration(),
new IndexDriver(), args);
System.exit(exitCode);
}
@Override
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), IndexDriver.class);
conf.setJobName("Index Builder - Adam S @ Cloudera");
conf.setSpeculativeExecution(false);
// Set Input and Output paths
FileInputFormat.setInputPaths(conf, new Path(args[0].toString()));
FileOutputFormat.setOutputPath(conf, new Path(args[1].toString()));
// Use TextInputFormat
conf.setInputFormat(TextInputFormat.class);
// Mapper has no output
conf.setMapperClass(IndexMapper.class);
conf.setMapOutputKeyClass(NullWritable.class);
conf.setMapOutputValueClass(NullWritable.class);
conf.setNumReduceTasks(0);
JobClient.runJob(conf);
return 0;
}
}
Read the article for more info.
Compressed files Here's info on handling compressed files: Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats
There seems to be some bug with Solr not handling zip files, here's the bugreport with a fix: https://issues.apache.org/jira/browse/SOLR-2416
Upvotes: 3