user1840253
user1840253

Reputation: 91

solr 3.6.1 splitting word boundaries at a dash

We have a trouble ticket format of numerics divided by a dash i.e., n-nnnnnnn

The link http://lucidworks.lucidimagination.com/display/solr/Tokenizers (in the sections on Standard Tokenizer and Classic Tokenizer) implies that both before and after the support of Unicode standard annex UAX#29 :

Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

Our Solr installation is only using StandardTokenizerFactory yet this trouble ticket format is being split in queries at the dash. I'm new to solr/lucene. I've downloaded the code for 3.6.1 and the comments imply the opposite (unless a dashed number is still considered a number). I wasn't able to follow the Lex processing:

Upvotes: 4

Views: 1294

Answers (1)

Rishi Dua
Rishi Dua

Reputation: 2334

You need the Regular Expression Pattern Tokenizer. This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

See the Javadocs for java.util.regex.Pattern for more information on Java regular expression syntax.

Upvotes: 1

Related Questions