ARZ
ARZ

Reputation: 2491

Split multi-lingual string using Regex to uni-lingual tokens

I want to split a multi-lingual string to uni-lingual tokens using Regex.

for example for this English-Arabic string :

'his name was محمد, and his mother name was آمنه.'

The result must be as below:

  1. 'his name was '
  2. 'محمد,'
  3. ' and his mother name was '
  4. 'آمنه.'

Upvotes: 5

Views: 720

Answers (2)

Ali Ahmadi
Ali Ahmadi

Reputation: 2417

System.Text.RegularExpressions.Regex regx = new System.Text.RegularExpressions.Regex(@"([\s\(\:]*[a-zA-Z]+[\s\)\:]*)+");
var matchs = regx.Matches(input).Cast<System.Text.RegularExpressions.Match>().ToList();

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336238

It's not perfect (you definitely need to try it on some real-world examples to see if it fits), but it's a start:

splitArray = Regex.Split(subjectString, 
    @"(?<=\p{IsArabic})    # (if the previous character is Arabic)
    [\p{Zs}\p{P}]+         # split on whitespace/punctuation
    (?=\p{IsBasicLatin})   # (if the following character is Latin)
    |                      # or
    (?<=\p{IsBasicLatin})  # vice versa
    [\s\p{P}]+
    (?=\p{IsArabic})", 
    RegexOptions.IgnorePatternWhitespace);

This splits on whitespace/punctuation if the preceding character is from the Arabic block and the following character is from the Basic Latin block (or vice versa).

Upvotes: 6

Related Questions