Reputation: 125
i have problems with regards to indexing item names with numbers and symbols. a sample of my data is shown below:
ANGLE BARS ORANGE - 4.0MM 2 - 1/2"
B.I SQUARE TUBING 2" X 3"
B.I. PIPE S-40 10MM 3/8"
B.I SQUARE TUBING 1" X 2"
PLYWOOD MARINE 3/4X4X8
PLYWOOD STA. CLARA 1/8X4X8
PLYWOOD STA. CLARA 3/16X4X8
i want to tokenize my data in white or trailing spaces without dropping the symbols because these symbols are very essential. so that whenever i search for "plywood sta. clara", "b.i square 2" X 3"", or "angle orange 2 - 1/2" will give me a result. i tried to used whitespace analyzer but the symbols are dropped. i also tried standardanalyzer but stop words and symbols are also dropped. what is the best analyzer to use instead?
Upvotes: 1
Views: 572
Reputation: 26012
You can use PatternAnalyzer by writing regular expression or create Custom Analyzer.
Upvotes: 3
Reputation: 2216
Try using a org.apache.lucene.analysis.miscellaneous.PatternAnalyzer. You can supply a regular expression to define token delimiters.
Upvotes: 0