user2080225
user2080225

Reputation:

How to index entire local Hard Drive into Apache Solr?

Is there a good approach with Solr or a client library feeding into Solr to index an entire hard drive. This should include content in the zip files, including recursively of zip files within zip files?

This should be able to run on Linux (no windows-only clients).

This will of course involve making a single scan over the entire file-system from the root (or any folder actually). I'm not concerned at this point with keeping the index up to date, just creating it initially. This would be similar to the old "Google Desktop" app, which Google discontinued.

Upvotes: 3

Views: 2028

Answers (1)

Johan
Johan

Reputation: 76753

You can manipulate Solr using the SolrJ API.

Here's the API documentation: http://lucene.apache.org/solr/4_0_0/solr-solrj/index.html

And here's a article on how to use SolrJ to index files on your harddrive.
http://blog.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/

Files are represented by InputDocument and you use .addField to attach fields that you'd like to search on at a later time.

Here's example code for an Index Driver:

public class IndexDriver extends Configured implements Tool {     

  public static void main(String[] args) throws Exception {
    //TODO: Add some checks here to validate the input path
    int exitCode = ToolRunner.run(new Configuration(),
     new IndexDriver(), args);
    System.exit(exitCode);
  }

  @Override
  public int run(String[] args) throws Exception {
    JobConf conf = new JobConf(getConf(), IndexDriver.class);
    conf.setJobName("Index Builder - Adam S @ Cloudera");
    conf.setSpeculativeExecution(false);

    // Set Input and Output paths
    FileInputFormat.setInputPaths(conf, new Path(args[0].toString()));
    FileOutputFormat.setOutputPath(conf, new Path(args[1].toString()));
    // Use TextInputFormat
    conf.setInputFormat(TextInputFormat.class);

    // Mapper has no output
    conf.setMapperClass(IndexMapper.class);
    conf.setMapOutputKeyClass(NullWritable.class);
    conf.setMapOutputValueClass(NullWritable.class);
    conf.setNumReduceTasks(0);
    JobClient.runJob(conf);
    return 0;
  }
}

Read the article for more info.

Compressed files Here's info on handling compressed files: Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats

There seems to be some bug with Solr not handling zip files, here's the bugreport with a fix: https://issues.apache.org/jira/browse/SOLR-2416

Upvotes: 3

Related Questions