Jamie Rees
Jamie Rees

Reputation: 8183

Email address splitting

So I have a string that I need to split by semicolon's

Email address: "one@tw;,.'o"@hotmail.com;"some;thing"@example.com

Both of the email addresses are valid

So I want to have a List<string> of the following:

But the way I am currently splitting the addresses is not working:

var addresses = emailAddressString.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries)
                .Select(x => x.Trim()).ToList();

Because of the multiple ; characters I end up with invalid email addresses.

I have tried a few different ways, even going down working out if the string contains quotes and then finding the index of the ; characters and working it out that way, but it's a real pain.

Does anyone have any better suggestions?

Upvotes: 11

Views: 3501

Answers (3)

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726809

Assuming that double-quotes are not allowed, except for the opening and closing quotes ahead of the "at" sign @, you can use this regular expression to capture e-mail addresses:

((?:[^@"]+|"[^"]*")@[^;]+)(?:;|$)

The idea is to capture either an unquoted [^@"]+ or a quoted "[^"]*" part prior to @, and then capture everything up to semicolon ; or the end anchor $.

Demo of the regex.

var input = "\"one@tw;,.'o\"@hotmail.com;\"some;thing\"@example.com;hello@world";
var mm = Regex.Matches(input, "((?:[^@\"]+|\"[^\"]*\")@[^;]+)(?:;|$)");
foreach (Match m in mm) {
    Console.WriteLine(m.Groups[1].Value);
}

This code prints

"one@tw;,.'o"@hotmail.com
"some;thing"@example.com
hello@world

Demo 1.

If you would like to allow escaped double-quotes inside double-quotes, you could use a more complex expression:

((?:(?:[^@\"]|(?<=\\)\")+|\"([^\"]|(?<=\\)\")*\")@[^;]+)(?:;|$)

Everything else remains the same.

Demo 2.

Upvotes: 14

juharr
juharr

Reputation: 32296

You can also do this without using regular expressions. The following extension method will allow you to specify a delimiter character and a character to begin and end escape sequences. Note it does not validate that all escape sequences are closed.

public static IEnumerable<string> SpecialSplit(
    this string str, char delimiter, char beginEndEscape)
{
    int beginIndex = 0;
    int length = 0;
    bool escaped = false;
    foreach (char c in str)
    {
        if (c == beginEndEscape)
        {
            escaped = !escaped;
        }
            
        if (!escaped && c == delimiter)
        {
            yield return str.Substring(beginIndex, length);
            beginIndex += length + 1;
            length = 0;
            continue;
        }

        length++;
    }

    yield return str.Substring(beginIndex, length);
}

Then the following

var input = "\"one@tw;,.'o\"@hotmail.com;\"some;thing\"@example.com;hello@world;\"D;D@blah;blah.com\"";
foreach (var address in input.SpecialSplit(';', '"')) 
    Console.WriteLine(v);

While give this output

"one@tw;,.'o"@hotmail.com

"some;thing"@example.com

hello@world

"D;D@blah;blah.com"

Here's the version that works with an additional single escape character. It assumes that two consecutive escape characters should become one single escape character and it's escaping both the beginEndEscape charter so it will not trigger the beginning or end of an escape sequence and it also escapes the delimiter. Anything else that comes after the escape character will be left as is with the escape character removed.

public static IEnumerable<string> SpecialSplit(
    this string str, char delimiter, char beginEndEscape, char singleEscape)
{
    StringBuilder builder = new StringBuilder();
    bool escapedSequence = false;
    bool previousEscapeChar = false;
    foreach (char c in str)
    {
        if (c == singleEscape && !previousEscapeChar)
        {
            previousEscapeChar = true;
            continue;
        }

        if (c == beginEndEscape && !previousEscapeChar)
        {
            escapedSequence = !escapedSequence;
        }

        if (!escapedSequence && !previousEscapeChar && c == delimiter)
        {
            yield return builder.ToString();
            builder.Clear();
            continue;
        }

        builder.Append(c);
        previousEscapeChar = false;
    }

    yield return builder.ToString();
}

Finally you probably should add null checking for the string that is passed in and note that both will return a sequence with one empty string if you pass in an empty string.

Upvotes: 3

Darren Gourley
Darren Gourley

Reputation: 1808

I obviously started writing my anti regex method at around the same time as juharr (Another answer). I thought that since I already have it written I would submit it.

    public static IEnumerable<string> SplitEmailsByDelimiter(string input, char delimiter)
    {
        var startIndex = 0;
        var delimiterIndex = 0;

        while (delimiterIndex >= 0)
        {
            delimiterIndex = input.IndexOf(';', startIndex);
            string substring = input;
            if (delimiterIndex > 0)
            {
                substring = input.Substring(0, delimiterIndex);
            }

            if (!substring.Contains("\"") || substring.IndexOf("\"") != substring.LastIndexOf("\""))
            {
                yield return substring;
                input = input.Substring(delimiterIndex + 1);
                startIndex = 0;
            }
            else
            {
                startIndex = delimiterIndex + 1;
            }
        }
    }

Then the following

            var input = "[email protected];\"one@tw;,.'o\"@hotmail.com;\"some;thing\"@example.com;hello@world;[email protected];";
            foreach (var email in SplitEmailsByDelimiter(input, ';'))
            {
                Console.WriteLine(email);
            }

Would give this output

[email protected]
"one@tw;,.'o"@hotmail.com
"some;thing"@example.com
hello@world
[email protected]

Upvotes: 5

Related Questions