Reputation: 91
We have a trouble ticket format of numerics divided by a dash i.e., n-nnnnnnn
The link http://lucidworks.lucidimagination.com/display/solr/Tokenizers (in the sections on Standard Tokenizer and Classic Tokenizer) implies that both before and after the support of Unicode standard annex UAX#29 :
Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
Our Solr installation is only using StandardTokenizerFactory yet this trouble ticket format is being split in queries at the dash. I'm new to solr/lucene. I've downloaded the code for 3.6.1 and the comments imply the opposite (unless a dashed number is still considered a number). I wasn't able to follow the Lex processing:
Asian languages, including Thai, Lao, Myanmar, and Khmer</li>
Upvotes: 4
Views: 1294
Reputation: 2334
You need the Regular Expression Pattern Tokenizer. This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.
See the Javadocs for java.util.regex.Pattern for more information on Java regular expression syntax.
Upvotes: 1