Pete Alvin
Pete Alvin

Reputation: 4790

Lucene.Net support phrases?: What is best approach to tokenize comma-delimited data (atomically) in fields during indexing?

I have a database with a column I wish to index that has comma-delimited names, e.g.,

User.FullNameList = "Helen Ready, Phil Collins, Brad Paisley"

I prefer to tokenize each name atomically (name as a whole searchable entity). What is the best approach for this?

  1. Did I miss a simple option to set the tokenize delimiter?
  2. Do I have to subclass or write my own class that to roll my own tokenizer?
  3. Something else? ;)

Or does Lucene.net not support phrases?

Or is it smart enough to handle this use case automatically?

I'm sure I'm not the first person to have to do this. Googling produced no noticeable solutions.

*** EDIT: using my example, I want to store these name phrases in a single field:

Helen Ready

Phil Collins

Brad Paisley

NOT these individual words:

Helen

Ready

Phil

Collins

Brad

Paisley

Upvotes: 1

Views: 1018

Answers (2)

bajafresh4life
bajafresh4life

Reputation: 12853

You can split the string by comma yourself, and either --

  • Index each name using the Keyword analyzer (non-tokenized)
  • OR index each name using the standard analyzer, and wrap your searches in quotes. Make sure to index a dummy term in between each name so that "Ready Phil" doesn't match the document

Upvotes: 0

Yuval F
Yuval F

Reputation: 20621

Edit: Having read your clarification, here is hopefully a more relevant answer:

  1. You did not miss an option to modify the separator character.
  2. You do need to roll your own tokenizer. I suggest you subclass CharTokenizer. You need to define isTokenChar() according to your spec, meaning that anything but a comma is a token char.

Upvotes: 1

Related Questions