Jason Miesionczek
Jason Miesionczek

Reputation: 14448

Best way to parse string of email addresses

So i am working with some email header data, and for the to:, from:, cc:, and bcc: fields the email address(es) can be expressed in a number of different ways:

First Last <[email protected]>
Last, First <[email protected]>
[email protected]

And these variations can appear in the same message, in any order, all in one comma separated string:

First, Last <[email protected]>, [email protected], First Last <[email protected]>

I've been trying to come up with a way to parse this string into separate First Name, Last Name, E-Mail for each person (omitting the name if only an email address is provided).

Can someone suggest the best way to do this?

I've tried to Split on the commas, which would work except in the second example where the last name is placed first. I suppose this method could work, if after i split, i examine each element and see if it contains a '@' or '<'/'>', if it doesn't then it could be assumed that the next element is the first name. Is this a good way to approach this? Have i overlooked another format the address could be in?


UPDATE: Perhaps i should clarify a little, basically all i am looking to do is break up the string containing the multiple addresses into individual strings containing the address in whatever format it was sent in. I have my own methods for validating and extracting the information from an address, it was just tricky for me to figure out the best way to separate each address.

Here is the solution i came up with to accomplish this:

String str = "Last, First <[email protected]>, [email protected], First Last <[email protected]>, \"First Last\" <[email protected]>";

List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
    if (str[c] == '@')
        atIdx = c;

    if (str[c] == ',')
        commaIdx = c;

    if (commaIdx > atIdx && atIdx > 0)
    {
        string temp = str.Substring(lastComma, commaIdx - lastComma);
        addresses.Add(temp);
        lastComma = commaIdx;
        atIdx = commaIdx;
    }

    if (c == str.Length -1)
    {
        string temp = str.Substring(lastComma, str.Legth - lastComma);
        addresses.Add(temp);
    }
}

if (commaIdx < 2)
{
    // if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
    addresses.Add(str);
}

The above code generates the individual addresses that i can process further down the line.

Upvotes: 14

Views: 15044

Answers (13)

Jakub M&#237;šek
Jakub M&#237;šek

Reputation: 1060

The clean and short solution is to use MailAddressCollection:

var collection = new MailAddressCollection();
collection.Add(addresses);

This approach parses a list of addresses separated with colon ,, and validates it according to RFC. It throws FormatException in case the addresses are invalid. As suggested in other posts, if you need to deal with invalid addresses, you have to pre-process or parse the value by yourself, otherwise recommending to use what .NET offers without using reflection.

Sample:

var collection = new MailAddressCollection();
collection.Add("Joe Doe <[email protected]>, [email protected]");

foreach (var addr in collection)
{
  // addr.DisplayName, addr.User, addr.Host
}

Upvotes: 3

Simon Green
Simon Green

Reputation: 1161

Here's what I came up with. It assumes that a valid email address must have one and only one '@' sign in it:

    public List<MailAddress> ParseAddresses(string field)
    {
        var tokens = field.Split(',');
        var addresses = new List<string>();

        var tokenBuffer = new List<string>();

        foreach (var token in tokens)
        {
            tokenBuffer.Add(token);

            if (token.IndexOf("@", StringComparison.Ordinal) > -1)
            {
                addresses.Add( string.Join( ",", tokenBuffer));
                tokenBuffer.Clear();
            }
        }

        return addresses.Select(t => new MailAddress(t)).ToList();
    }

Upvotes: 0

user7984361
user7984361

Reputation: 91

There is internal System.Net.Mail.MailAddressParser class which has method ParseMultipleAddresses which does exactly what you want. You can access it directly through reflection or by calling MailMessage.To.Add method, which accepts email list string.

private static IEnumerable<MailAddress> ParseAddress(string addresses)
{
    var mailAddressParserClass = Type.GetType("System.Net.Mail.MailAddressParser");
    var parseMultipleAddressesMethod = mailAddressParserClass.GetMethod("ParseMultipleAddresses", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
    return (IList<MailAddress>)parseMultipleAddressesMethod.Invoke(null, new object[0]);
}


    private static IEnumerable<MailAddress> ParseAddress(string addresses)
    {
        MailMessage message = new MailMessage();
        message.To.Add(addresses);
        return new List<MailAddress>(message.To); //new List, because we don't want to hold reference on Disposable object
    }

Upvotes: 9

Phil Haselden
Phil Haselden

Reputation: 3026

Your 2nd email example is not a valid address as it contains a comma which is not within a quoted string. To be valid it should be like: "Last, First"<[email protected]>.

As for parsing, if you want something that is quite strict, you could use System.Net.Mail.MailAddressCollection.

If you just want to your input split into separate email strings, then the following code should work. It is not very strict but will handle commas within quoted strings and throw an exception if the input contains an unclosed quote.

public List<string> SplitAddresses(string addresses)
{
    var result = new List<string>();

    var startIndex = 0;
    var currentIndex = 0;
    var inQuotedString = false;

    while (currentIndex < addresses.Length)
    {
        if (addresses[currentIndex] == QUOTE)
        {
            inQuotedString = !inQuotedString;
        }
        // Split if a comma is found, unless inside a quoted string
        else if (addresses[currentIndex] == COMMA && !inQuotedString)
        {
            var address = GetAndCleanSubstring(addresses, startIndex, currentIndex);
            if (address.Length > 0)
            {
                result.Add(address);
            }
            startIndex = currentIndex + 1;
        }
        currentIndex++;
    }

    if (currentIndex > startIndex)
    {
        var address = GetAndCleanSubstring(addresses, startIndex, currentIndex);
        if (address.Length > 0)
        {
            result.Add(address);
        }
    }

    if (inQuotedString)
        throw new FormatException("Unclosed quote in email addresses");

    return result;
}

private string GetAndCleanSubstring(string addresses, int startIndex, int currentIndex)
{
    var address = addresses.Substring(startIndex, currentIndex - startIndex);
    address = address.Trim();
    return address;
}

Upvotes: 4

richard
richard

Reputation: 12564

I decided that I was going to draw a line in the sand at two restrictions:

  1. The To and Cc headers have to be csv parseable strings.
  2. Anything MailAddress couldn't parse, I'm just not going to worry about it.

I also decided I'm just interested in email addresses and not display name, since display name is so problematic and hard to define, whereas email address I can validate. So I used MailAddress to validate my parsing.

I treated the To and Cc headers like a csv string, and again, anything not parseable in that way I don't worry about it.

private string GetProperlyFormattedEmailString(string emailString)
    {
        var emailStringParts = CSVProcessor.GetFieldsFromString(emailString);

        string emailStringProcessed = "";

        foreach (var part in emailStringParts)
        {
            try
            {
                var address = new MailAddress(part);
                emailStringProcessed += address.Address + ",";
            }
            catch (Exception)
            {
                //wasn't an email address
                throw;
            }
        }

        return emailStringProcessed.TrimEnd((','));
    }

EDIT

Further research has showed me that my assumptions are good. Reading through the spec RFC 2822 pretty much shows that the To, Cc, and Bcc fields are csv-parseable fields. So yeah it's hard and there are a lot of gotchas, as with any csv parsing, but if you have a reliable way to parse csv fields (which TextFieldParser in the Microsoft.VisualBasic.FileIO namespace is, and is what I used for this), then you are golden.

Edit 2

Apparently they don't need to be valid CSV strings...the quotes really mess things up. So your csv parser has to be fault tolerant. I made it try to parse the string, if it failed, it strips all quotes and tries again:

public static string[] GetFieldsFromString(string csvString)
    {
        using (var stringAsReader = new StringReader(csvString))
        {
            using (var textFieldParser = new TextFieldParser(stringAsReader))
            {
                SetUpTextFieldParser(textFieldParser, FieldType.Delimited, new[] {","}, false, true);

                try
                {
                    return textFieldParser.ReadFields();
                }
                catch (MalformedLineException ex1)
                {
                    //assume it's not parseable due to double quotes, so we strip them all out and take what we have
                    var sanitizedString = csvString.Replace("\"", "");

                    using (var sanitizedStringAsReader = new StringReader(sanitizedString))
                    {
                        using (var textFieldParser2 = new TextFieldParser(sanitizedStringAsReader))
                        {
                            SetUpTextFieldParser(textFieldParser2, FieldType.Delimited, new[] {","}, false, true);

                            try
                            {
                                return textFieldParser2.ReadFields().Select(part => part.Trim()).ToArray();
                            }
                            catch (MalformedLineException ex2)
                            {
                                return new string[] {csvString};
                            }
                        }
                    }
                }
            }
        }
    }

The one thing it won't handle is quoted accounts in an email i.e. "Monkey Header"@stupidemailaddresses.com.

And here's the test:

[Subject(typeof(CSVProcessor))]
public class when_processing_an_email_recipient_header
{
    static string recipientHeaderToParse1 = @"""Lastname, Firstname"" <[email protected]>" + "," +
                                           @"<[email protected]>, [email protected], [email protected]" + "," +
                                           @"<[email protected]>, [email protected]" + "," +
                                           @"""""Yes, this is valid""""@[emails are hard to parse!]" + "," +
                                           @"First, Last <[email protected]>, [email protected], First Last <[email protected]>"
                                           ;

    static string[] results1;
    static string[] expectedResults1;

    Establish context = () =>
    {
        expectedResults1 = new string[]
        {
            @"Lastname",
            @"Firstname <[email protected]>",
            @"<[email protected]>",
            @"[email protected]",
            @"[email protected]",
            @"<[email protected]>",
            @"[email protected]",
            @"Yes",
            @"this is valid@[emails are hard to parse!]",
            @"First",
            @"Last <[email protected]>",
            @"[email protected]",
            @"First Last <[email protected]>"
        };
    };

    Because of = () =>
    {
        results1 = CSVProcessor.GetFieldsFromString(recipientHeaderToParse1);
    };

    It should_parse_the_email_parts_properly = () => results1.ShouldBeLike(expectedResults1);
}

Upvotes: 0

Alex Yakimovich
Alex Yakimovich

Reputation: 1

I use the following regular expression in Java to get email string from RFC-compliant email address:

[A-Za-z0-9]+[A-Za-z0-9._-]+@[A-Za-z0-9]+[A-Za-z0-9._-]+[.][A-Za-z0-9]{2,3}

Upvotes: -2

Vince
Vince

Reputation: 615

// Based on Michael Perry's answer * // needs to handle [email protected], [email protected] and related syntaxes // also looks for first and last name within those email syntaxes

public class ParsedEmail
{
    private string _first;
    private string _last;
    private string _name;
    private string _domain;

    public ParsedEmail(string first, string last, string name, string domain)
    {
        _name = name;
        _domain = domain;

        // [email protected], [email protected] etc. syntax
        char[] chars = { '.', '_', '+', '-' };
        var pos = _name.IndexOfAny(chars);

        if (string.IsNullOrWhiteSpace(_first) && string.IsNullOrWhiteSpace(_last) && pos > -1)
        {
            _first = _name.Substring(0, pos);
            _last = _name.Substring(pos+1);
        }
    }

    public string First
    {
        get { return _first; }
    }

    public string Last
    {
        get { return _last; }
    }

    public string Name
    {
        get { return _name; }
    }

    public string Domain
    {
        get { return _domain; }
    }

    public string Email
    {
        get
        {
            return Name + "@" + Domain;
        }
    }

    public override string ToString()
    {
        return Email;
    }

    public static IEnumerable<ParsedEmail> SplitEmailList(string delimList)
    {
        delimList = delimList.Replace("\"", string.Empty);

        Regex re = new Regex(
                    @"((?<last>\w*), (?<first>\w*) <(?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*)>)|" +
                    @"((?<first>\w*) (?<last>\w*) <(?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*)>)|" +
                    @"((?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*))");


        MatchCollection matches = re.Matches(delimList);

        var parsedEmails =
                   (from Match match in matches
                    select new ParsedEmail(
                            match.Groups["first"].Value,
                            match.Groups["last"].Value,
                            match.Groups["name"].Value,
                            match.Groups["domain"].Value)).ToList();

        return parsedEmails;

    }


}

Upvotes: 0

Jason Miesionczek
Jason Miesionczek

Reputation: 14448

Here is the solution i came up with to accomplish this:

String str = "Last, First <[email protected]>, [email protected], First Last <[email protected]>, \"First Last\" <[email protected]>";

List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
if (str[c] == '@')
    atIdx = c;

if (str[c] == ',')
    commaIdx = c;

if (commaIdx > atIdx && atIdx > 0)
{
    string temp = str.Substring(lastComma, commaIdx - lastComma);
    addresses.Add(temp);
    lastComma = commaIdx;
    atIdx = commaIdx;
}

if (c == str.Length -1)
{
    string temp = str.Substring(lastComma, str.Legth - lastComma);
    addresses.Add(temp);
}
}

if (commaIdx < 2)
{
    // if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
    addresses.Add(str);
}

Upvotes: 2

Michael L Perry
Michael L Perry

Reputation: 7437

At the risk of creating two problems, you could create a regular expression that matches any of your email formats. Use "|" to separate the formats within this one regex. Then you can run it over your input string and pull out all of the matches.

public class Address
{
    private string _first;
    private string _last;
    private string _name;
    private string _domain;

    public Address(string first, string last, string name, string domain)
    {
        _first = first;
        _last = last;
        _name = name;
        _domain = domain;
    }

    public string First
    {
        get { return _first; }
    }

    public string Last
    {
        get { return _last; }
    }

    public string Name
    {
        get { return _name; }
    }

    public string Domain
    {
        get { return _domain; }
    }
}

[TestFixture]
public class RegexEmailTest
{
    [Test]
    public void TestThreeEmailAddresses()
    {
        Regex emailAddress = new Regex(
            @"((?<last>\w*), (?<first>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
            @"((?<first>\w*) (?<last>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
            @"((?<name>\w*)@(?<domain>\w*\.\w*))");
        string input = "First, Last <[email protected]>, [email protected], First Last <[email protected]>";

        MatchCollection matches = emailAddress.Matches(input);
        List<Address> addresses =
            (from Match match in matches
             select new Address(
                 match.Groups["first"].Value,
                 match.Groups["last"].Value,
                 match.Groups["name"].Value,
                 match.Groups["domain"].Value)).ToList();
        Assert.AreEqual(3, addresses.Count);

        Assert.AreEqual("Last", addresses[0].First);
        Assert.AreEqual("First", addresses[0].Last);
        Assert.AreEqual("name", addresses[0].Name);
        Assert.AreEqual("domain.com", addresses[0].Domain);

        Assert.AreEqual("", addresses[1].First);
        Assert.AreEqual("", addresses[1].Last);
        Assert.AreEqual("name", addresses[1].Name);
        Assert.AreEqual("domain.com", addresses[1].Domain);

        Assert.AreEqual("First", addresses[2].First);
        Assert.AreEqual("Last", addresses[2].Last);
        Assert.AreEqual("name", addresses[2].Name);
        Assert.AreEqual("domain.com", addresses[2].Domain);
    }
}

There are several down sides to this approach. One is that it doesn't validate the string. If you have any characters in the string that don't fit one of your chosen formats, then those characters are just ignored. Another is that the accepted formats are all expressed in one place. You cannot add new formats without changing the monolithic regex.

Upvotes: 4

Anders
Anders

Reputation: 12580

You could use regular expressions to try to separate this out, try this guy:

^(?<name1>[a-zA-Z0-9]+?),? (?<name2>[a-zA-Z0-9]+?),? (?<address1>[a-zA-Z0-9.-_<>]+?)$

will match: Last, First [email protected]; Last, First <[email protected]>; First last [email protected]; First Last <[email protected]>. You can add another optional match in the regex at the end to pick up the last segment of First, Last <[email protected]>, [email protected] after the email address enclosed in angled braces.

Hope this helps somewhat!

EDIT:

and of course you can add more characters to each of the sections to accept quotations etc for whatever format is being read in. As sjbotha mentioned, this could be difficult as the string that is submitted is not necessarily in a set format.

This link can give you more information about matching AND validating email addresses using regular expressions.

Upvotes: 0

jle
jle

Reputation: 9489

Here is how I would do it:

  • You can try to standardize the data as much as possible i.e. get rid of such things as the < and > symbols and all of the commas after the '.com.' You will need the commas that separate the first and last names.
  • After getting rid of the extra symbols, put every grouped email record in a list as a string. You can use the .com to determine where to split the string if need be.
  • After you have the list of email addresses in the list of strings, you can then further split the email addresses using only whitespace as the delimeter.
  • The final step is to determine what is the first name, what is the last name, etc. This would be done by checking the 3 components for: a comma, which would indicate that it is the last name; a . which would indicate the actual address; and whatever is left is the first name. If there is no comma, then the first name is first, last name is second, etc.

    I don't know if this is the most concise solution, but it would work and does not require any advanced programming techniques

Upvotes: 0

Scott Dorman
Scott Dorman

Reputation: 42526

There is no generic simple solution to this. The RFC you want is RFC2822, which describes all of the possible configurations of an email address. The best you are going to get that will be correct is to implement a state-based tokenizer that follows the rules specified in the RFC.

Upvotes: 2

Sarel Botha
Sarel Botha

Reputation: 12740

There isn't really an easy solution to this. I would recommend making a little state machine that reads char-by-char and do the work that way. Like you said, splitting by comma won't always work.

A state machine will allow you to cover all possibilities. I'm sure there are many others you haven't seen yet. For example: "First Last"

Look for the RFC about this to discover what all the possibilities are. Sorry, I don't know the number. There are probably multiple as this is the kind of things that evolves.

Upvotes: 4

Related Questions