SilverCrow
SilverCrow

Reputation: 167

split a string that contain english and Hebrew in c#

I have this string:

string str = "לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל [email protected]";

and I'm trying to split it the following way:

string[0] = "לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל "
string[1] = "[email protected]"

I'm using this split method:

string[] split =  Regex.Split(str, @"^[א-ת]+$");

I want to split between Hebrew and English words, but if the last word is the same as the current add it to the last

But I can not make it work, what am I doing wrong?

Thanks

Upvotes: 1

Views: 1008

Answers (6)

Kobi
Kobi

Reputation: 138037

Here's one approach:

[\p{IsHebrew}\P{L}]+|\P{IsHebrew}+

Use this pattern with Regex.Matches:

var matches = Regex.Matches(input, @"[\p{IsHebrew}\P{L}]+|\P{IsHebrew}+");

The pattern has two parts. It either matches:

  • [\p{IsHebrew}\P{L}]+ - a block containing Hebrew characters and non-letters,

OR

  • \P{IsHebrew}+ - a block of non-Hebrew characters (including non-Hebrew letters and other non-letter characters).

We're using Unicode Named Blocks like \p{IsHebrew} and \p{IsBasicLatin}.

A similar option is [\p{IsHebrew}\P{L}]+|[\p{IsBasicLatin}\P{L}]+ - is matches specifically a block with Latin (English) letters.

Working example: regex storm, C# example

Upvotes: 2

styx
styx

Reputation: 1915

why not simply use \p{IsHebrew} ?

something like this

 string str = "לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל [email protected]";
 string pattern = @"[\p{IsHebrew}]+";
 var hebrewMatchCollection = Regex.Matches(str, pattern);
 string hebrewPart = string.Join(" ", hebrewMatchCollection.Cast<Match>().Select(m => m.Value));  //combine regex collection
 var englishPart = Regex.Split(str, pattern).Last(); 

Upvotes: 0

Panagiotis Kanavos
Panagiotis Kanavos

Reputation: 131492

The pattern in Regex.Split matches the delimiter and isn't included in the results. Looks like you want to split between the last Hebrew and first non-Hebrew character, eg :

Regex.Split(str,@"\p{IsHebrew} \P{IsHebrew}")

\p{} captures a character that belongs to a specific Unicode character class or named block while \P{} excludes it.

Unfortunately, this pattern will exclude the last Hebrew and first non-Hebrew character and return :

לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות   
[email protected] 

Capture groups are used to include characters captured by a delimiter pattern in the results. Simply using a group though with (\p{IsHebrew}) (\P{IsHebrew}) will return each capture group as a separate result :

לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות  
ל 
m 
[email protected] 

Vladi Pavelka's use of forward and back references fixes this and (?<=\p{IsHebrew}) (?=\P{IsHebrew}) will return the expected results :

Regex.Split(str,@"(?<=\p{IsHebrew}) (?=\P{IsHebrew})")

will return :

לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל 
[email protected] 

Upvotes: 0

Vladi Pavelka
Vladi Pavelka

Reputation: 926

Try this:

string[] split = Regex.Split(str, @"(?<=[א-ת]+) (?=[A-z]+)")

?<= - lookbehind - Asserts what immediately PRECEDES the current position

?= - lookahead - Asserts what immediately FOLLOWS the current position

This will resolve the string "splitter" as the place between Hebrew and Latin characters

Upvotes: 1

Nhan Phan
Nhan Phan

Reputation: 1302

From your input string, we can consider that we can split the string to Hebrew and an email address in the end of the string.

Then the regex can be( just example):

\w*@gmail.com$

You can test the regex here: https://regexr.com/

Upvotes: 0

Antoine V
Antoine V

Reputation: 7204

Why don't you think differently? The question here is: How to get the emails from the text.

There is a lot of posts for this question.

For example, this

public static void emas(string text)
        {
            const string MatchEmailPattern =
           @"(([\w-]+\.)+[\w-]+|([a-zA-Z]{1}|[\w-]{2,}))@"
           + @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\."
             + @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
           + @"([a-zA-Z]+[\w-]+\.)+[a-zA-Z]{2,4})";
            Regex rx = new Regex(MatchEmailPattern,  RegexOptions.Compiled | RegexOptions.IgnoreCase);
            // Find matches.
            MatchCollection matches = rx.Matches(text);
            // Report the number of matches found.
            int noOfMatches = matches.Count;
            // Report on each match.
            foreach (Match match in matches)
            {
                Console.WriteLine(match.Value.ToString());
            }
        }

Upvotes: 0

Related Questions