Dmitry
Dmitry

Reputation: 23

Tamil language full-word search with .NET Regex

I have a Grid filled with Tamil words and a search string. I need to implement a full-word search through the Grid records. I'm using .NET Regex class for that approach. It sounds pretty simple, what I used to do is:

string pattern = @"\b" + searchText + @"\b".

It works as expected in Latin languages but for Tamil, this expression returns strange results. I have read about Unicode characters in regular expressions but that doesn't seem quite helpful to me. What I probably need is to determine where is the word boundary found and why.

As an example: For the "\bஅம்மா\b" pattern Regex found matches in அம்மாவிடம் and அம்மாக்கள் records but not in the original அம்மா record.

Upvotes: 2

Views: 1045

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627327

The last char in "அம்மா" word is ‎0BBE TAMIL VOWEL SIGN AA and it is a combining mark (in regex, it can be matched with \p{M}).

As \b only matches between start/end of string and a word char or between a word and a non-word char, it won't match after the char and a non-word char.

Use a usual workaround in this case.

var pattern = $@"(?<!\w){searchText}(?!\w)";

See this regex demo.

Here, (?<!\w) fails the match if there is a word char before searchText and (?!\w) fails the match if there is a word char after the text to find. Note you may also use Regex.Escape(searchText) if the text can contains special regex chars.

Or, if you want to avoid matching when inside base letters/diacritics, use

var pattern = $@"(?<![\p{{L}}\p{{M}}]){searchText}(?![\p{{L}}\p{{M}}])";

See this regex demo.

The (?<![\p{L}\p{M}]) and (?![\p{L}\p{M}]) lookarounds work similarly as the ones above, just they fails the match if there is a letter or a combining mark on either side of the search phrase.

Upvotes: 1

Related Questions