Reputation: 2491
I want to split a multi-lingual string to uni-lingual tokens using Regex.
for example for this English-Arabic string :
'his name was محمد, and his mother name was آمنه.'
The result must be as below:
Upvotes: 5
Views: 720
Reputation: 2417
System.Text.RegularExpressions.Regex regx = new System.Text.RegularExpressions.Regex(@"([\s\(\:]*[a-zA-Z]+[\s\)\:]*)+");
var matchs = regx.Matches(input).Cast<System.Text.RegularExpressions.Match>().ToList();
Upvotes: 0
Reputation: 336238
It's not perfect (you definitely need to try it on some real-world examples to see if it fits), but it's a start:
splitArray = Regex.Split(subjectString,
@"(?<=\p{IsArabic}) # (if the previous character is Arabic)
[\p{Zs}\p{P}]+ # split on whitespace/punctuation
(?=\p{IsBasicLatin}) # (if the following character is Latin)
| # or
(?<=\p{IsBasicLatin}) # vice versa
[\s\p{P}]+
(?=\p{IsArabic})",
RegexOptions.IgnorePatternWhitespace);
This splits on whitespace/punctuation if the preceding character is from the Arabic block and the following character is from the Basic Latin block (or vice versa).
Upvotes: 6