Reputation: 3479
I'm trying to index and STORE file content (plain text), but it seems using that way it isn't possible:
protected Document getDocument(File f) throws Exception {
Document doc = new Document();
Field contents = new Field("contents", new FileReader(f));
Field filename = new Field("filename", f.getName(), Field.Store.YES, Field.Index.ANALYZED);
doc.add(contents);
return doc;
}
How to store content of plain text file (without any tags)?
Upvotes: 0
Views: 1433
Reputation: 26733
Just read the file contents and use another Field constructor, something like
protected Document getDocument(File f) throws Exception {
Document doc = new Document();
Field contents = new Field("contents", new Scanner(f).useDelimiter("\\A").next(), Store.YES, Index.NO); // you should actually close the scanner
Field filename = new Field("filename", f.getName(), Store.YES, Index.ANALYZED);
doc.add(contents);
doc.add(filename);
return doc;
}
Upvotes: 3
Reputation: 7848
Take a look at Apache Tika (http://tika.apache.org/). They have a good library for extracting text from HTML and other structured documents. That will help extract the text from the HTML.
As for storing in the lucene index, depending on your needs you can either strip out the tags before storing it. Or, you can create an Analyzer with it to stip the tags as it is indexed.
Upvotes: 1