Lucene not indexing large, unanalysed fields

Question

I have found that Lucene won't index an un-analysed field if it's too large (looks like a 16kb limit).

In my app, I am searching for eg. " *something* ". This works fine and finds my doc. If I increase the size of the text, over 16kb, the search stops finding it.

Here's how the field is added...

String property = ...
String value = ...
Field field = new Field(property, value, Field.Store.NO, Field.Index.NOT_ANALYZED);

Due to a bug in eclipse I'm unable to debug the lucene code (currently, installing NetBeans!) so wondering if anyone knows where the limit is set and if it can be increased?

And before anyone suggests not using NOT_ANALYZED or shortening the text, thats in the pipeline!

femtoRgon · Accepted Answer

I know you said not to suggest it, but:

Don't use NOT_ANALYZED for searching long, full text fields.

Indexing a long, full-text field as NOT_ANALYZED and then searching with a double-wildcard means you are getting absolutely no benefit from lucene's full-text search capabilities. This sort of implementation is just a lucene-powered, extra-fancy linear search. You could just as well store all your data in plain-text file and search for a match one character at a time.

Changing this hard maximum term size would be difficult, I believe. It would need to be changed in the DocumentsWriter impl, and comments indicate that the field cache implementation would need to be modified. Not worth looking to far into to keep using an overly-complicated linear search.

You say analysis is in the pipeline, but it's central to actually performing an effective search with lucene. It's not a cool feature to add later, it's something you must have. Just start with StandardAnalyzer, and refine from there, if necessary.

Lucene not indexing large, unanalysed fields

Answers (1)

Related Questions