solr 3.6.1 splitting word boundaries at a dash

Question

We have a trouble ticket format of numerics divided by a dash i.e., n-nnnnnnn

The link http://lucidworks.lucidimagination.com/display/solr/Tokenizers (in the sections on Standard Tokenizer and Classic Tokenizer) implies that both before and after the support of Unicode standard annex UAX#29 :

Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

Our Solr installation is only using StandardTokenizerFactory yet this trouble ticket format is being split in queries at the dash. I'm new to solr/lucene. I've downloaded the code for 3.6.1 and the comments imply the opposite (unless a dashed number is still considered a number). I wasn't able to follow the Lex processing:

Tokens produced are of the following types:
- : A sequence of alphabetic and numeric characters
- : A number
- : A sequence of characters from South and Southeast
- ```
  Asian languages, including Thai, Lao, Myanmar, and Khmer
```
- : A single CJKV ideographic character
- : A single hiragana character

solr 3.6.1 splitting word boundaries at a dash

Answers (1)

Related Questions