Reputation: 215
So I am making a search engine for a site using Zend_Search_Lucene
I am currently using Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive which works fine, except for one thing: it makes distinctions between accented and not accented characters
In google (and other search engines) when you search for "χιονι" it will return results for all variations of it, like "χιόνι" which is the correct accented version in greek (χιόνι = snow btw). In lucene (in general, not only Zend_Search_Lucene) this is not default or even bundled behavior from what I've seen
My first attempt for a solution was to do kind of what lucene does for case insensitive search - analyzers, remove accents from letters the same way case insensitive analyzers simply make everything lowercase during indexing & searching (ie $str = strtr($str, 'ό', 'ο'))
The only reason this failed is because php does not have an mb_strtr and strtr does not work for multibyte characters like this, and preg_replace just didn't work either
Is there a way to make lucene search in "accent-insensitive" mode (an analyzer probably?), or alternatively a way to unaccent multibyte characters in php (I also did search on this with no results)?
Mind that what I want to search for is not western-european accented characters for which there are some unaccent solutions for php on the web
Upvotes: 3
Views: 1528
Reputation: 13296
Have you tried normalizer_normalize to remove diacritics from text: How to remove diacritics from text?
You can also use $str = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $str);
You can then create a token filter (by extending Zend_Search_Lucene_Analysis_TokenFilter) to normalize your keywords.
I don't know if it works for your encoding.
Upvotes: 2