ford prefect
ford prefect

Reputation: 7388

When to use custom fields in Lucene

I am working on building a search engine for building domain specific data using Lucene. Lucene is clearly powerful and customizable. Originally I created my own field types and was using those but then I was getting 0 hits so I read this and found that I should use text fields. One of my fields is a date and another is a low cardinality category. I looked through the setters for Field and couldn't figure out what StringField and TextField implied and how I should look at them. Should I use a custom field type for not strictly textual fields?

Upvotes: 0

Views: 1037

Answers (1)

Philipp Ludwig
Philipp Ludwig

Reputation: 4190

TextField and StringField

The difference between TextField and StringField are hidden in the class FieldType. FieldType allows you to define fields with custom properties like this:

FieldType type = new FieldType();
type.setTokenized(true);
type.setStoreTermVectors(true);
...
document.add(new Field("fieldName", someString, type));

So, both classes extend from Field and set a different field type. This gets more confusing, since the field type differs depending on if the field is stored or not. (Source: Lucene 6.5 source code)

When to use what

To make it short:

  • Use StringField for information like IDs, URLs, etc. which won't require any tokening, stemming or other processing by some Analyzer.
  • Use TextField for information which requires this processing, such as the title or the content of a document.
  • For anything else which has to be stored in the index, use a StoredField.

Numeric fields

Looking at the documentation, we can see that besides the types Field, StringField and TextField lucene offers mostly numeric "Points". Points work like fields in the meaning that they are indexed, but not stored (see StoredField above for that).

For your date, I would recommend using a LongPoint to store a timestamp, e.g.:

document.add(new LongPoint("date", someCalendar.getTimeInMillis() / 1000));

Using a point will later allow you to perform range queries using LongPoint.newRangeQuery, which can be used to retrieve the documents in a given time frame, or applied as an additional filter to an existing query.

Regarding your "low cardinality category", I'm not sure what you mean, but if it's a number you could use an IntPoint.

Upvotes: 1

Related Questions