Reputation: 727
Context: The following expression are written for tamil language text
'^[சிகு]'
is the intended expression for lines that starts with either 'சி' or 'கு'
just like how in English '^[ab]'
matches lines that start with either 'a' or 'b'
But since Unicode represents some of the eastern languages with multiple code points '^[ச,ி,க,ு]'
(using the commas for clarity) சி -> ச,ி
and கு -> க,ு
Running the expression over few words in python, gives the following results (you can see the full results here).
Note: expected results can be obtained by using this expression '^(சி|கு)'
but this works for this specific case, but what if I want to write expressions to match சிசிசிகுகுசிகு
? is there anyway to make the expression '^[சிகு]+'
to match சிசிசிகுகுசிகு
??
For ease of use, I adding the textual samples here.
Expected:
குல்
குழை
குறை
சிலை
குறி
குரு
சிறை
குடி
குடை
குமை
சிதை
குலை
குளி
குவி
Matched:
கடி
கழி
கலி
கலை
கா
கோடு
குல்
சேர்
சரி
கை
கரை
சாய்
கடு
குழை
குறை
கோ
சுழி
1 https://gist.github.com/vanangamudi/591e311d709f5d5d6672a34d09b510cc
Upvotes: 5
Views: 137
Reputation: 627119
Character classes in Python only match a single code unit/point, those that can be matched with \uXXXX
or \UXXXXXXXX
notations. Character classes do not match char sequences. Grouping constructs are meant to do that.
You have multibyte characters that contain several code units, and they cannot be re-written as single code points, hence you will always get the OR behavior between the characters inside a character class as you described: [சிகு]
(seen by the regex engine as [ச,ிக,ு]
will match one of the four chars defined in the class, not either of the two character sequences.
To match character sequences, like the code units in the multibyte characters, you will have to use a grouping construct:
சி|கு
(?:சி|கு)
(சி|கு)
Upvotes: 2