how to use alphabets that use multiple unicode points in regex?

Question

Context: The following expression are written for tamil language text

'^[சிகு]' is the intended expression for lines that starts with either 'சி' or 'கு' just like how in English '^[ab]' matches lines that start with either 'a' or 'b'

But since Unicode represents some of the eastern languages with multiple code points '^[ச,ி,க,ு]' (using the commas for clarity) சி -> ச,ி and கு -> க,ு

Running the expression over few words in python, gives the following results (you can see the full results here).

Note: expected results can be obtained by using this expression '^(சி|கு)' but this works for this specific case, but what if I want to write expressions to match சிசிசிகுகுசிகு? is there anyway to make the expression '^[சிகு]+' to match சிசிசிகுகுசிகு??

For ease of use, I adding the textual samples here.

Expected:

குல்
குழை
குறை
சிலை
குறி
குரு
சிறை
குடி
குடை
குமை
சிதை
குலை
குளி
குவி

Matched:

கடி
கழி
கலி
கலை
கா
கோடு
குல்
சேர்
சரி
கை
கரை
சாய்
கடு
குழை
குறை
கோ
சுழி

1 https://gist.github.com/vanangamudi/591e311d709f5d5d6672a34d09b510cc

how to use alphabets that use multiple unicode points in regex?

Answers (1)

Related Questions