gaffcz
gaffcz

Reputation: 3479

Lucene: how to store file content?

I'm trying to index and STORE file content (plain text), but it seems using that way it isn't possible:

protected Document getDocument(File f) throws Exception {
  Document doc = new Document();
  Field contents = new Field("contents", new FileReader(f));
  Field filename = new Field("filename", f.getName(), Field.Store.YES, Field.Index.ANALYZED);
  doc.add(contents);
  return doc;
}

How to store content of plain text file (without any tags)?

Upvotes: 0

Views: 1433

Answers (2)

mindas
mindas

Reputation: 26733

Just read the file contents and use another Field constructor, something like

protected Document getDocument(File f) throws Exception {
  Document doc = new Document();
  Field contents = new Field("contents", new Scanner(f).useDelimiter("\\A").next(), Store.YES, Index.NO);  // you should actually close the scanner
  Field filename = new Field("filename", f.getName(), Store.YES, Index.ANALYZED);
  doc.add(contents);
  doc.add(filename);
  return doc;
}

Upvotes: 3

jcern
jcern

Reputation: 7848

Take a look at Apache Tika (http://tika.apache.org/). They have a good library for extracting text from HTML and other structured documents. That will help extract the text from the HTML.

As for storing in the lucene index, depending on your needs you can either strip out the tags before storing it. Or, you can create an Analyzer with it to stip the tags as it is indexed.

Upvotes: 1

Related Questions