Reputation: 14448
So i am working with some email header data, and for the to:, from:, cc:, and bcc: fields the email address(es) can be expressed in a number of different ways:
First Last <[email protected]>
Last, First <[email protected]>
[email protected]
And these variations can appear in the same message, in any order, all in one comma separated string:
First, Last <[email protected]>, [email protected], First Last <[email protected]>
I've been trying to come up with a way to parse this string into separate First Name, Last Name, E-Mail for each person (omitting the name if only an email address is provided).
Can someone suggest the best way to do this?
I've tried to Split on the commas, which would work except in the second example where the last name is placed first. I suppose this method could work, if after i split, i examine each element and see if it contains a '@' or '<'/'>', if it doesn't then it could be assumed that the next element is the first name. Is this a good way to approach this? Have i overlooked another format the address could be in?
UPDATE: Perhaps i should clarify a little, basically all i am looking to do is break up the string containing the multiple addresses into individual strings containing the address in whatever format it was sent in. I have my own methods for validating and extracting the information from an address, it was just tricky for me to figure out the best way to separate each address.
Here is the solution i came up with to accomplish this:
String str = "Last, First <[email protected]>, [email protected], First Last <[email protected]>, \"First Last\" <[email protected]>";
List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
if (str[c] == '@')
atIdx = c;
if (str[c] == ',')
commaIdx = c;
if (commaIdx > atIdx && atIdx > 0)
{
string temp = str.Substring(lastComma, commaIdx - lastComma);
addresses.Add(temp);
lastComma = commaIdx;
atIdx = commaIdx;
}
if (c == str.Length -1)
{
string temp = str.Substring(lastComma, str.Legth - lastComma);
addresses.Add(temp);
}
}
if (commaIdx < 2)
{
// if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
addresses.Add(str);
}
The above code generates the individual addresses that i can process further down the line.
Upvotes: 14
Views: 15044
Reputation: 1060
The clean and short solution is to use MailAddressCollection:
var collection = new MailAddressCollection();
collection.Add(addresses);
This approach parses a list of addresses separated with colon ,
, and validates it according to RFC. It throws FormatException
in case the addresses are invalid. As suggested in other posts, if you need to deal with invalid addresses, you have to pre-process or parse the value by yourself, otherwise recommending to use what .NET offers without using reflection.
var collection = new MailAddressCollection();
collection.Add("Joe Doe <[email protected]>, [email protected]");
foreach (var addr in collection)
{
// addr.DisplayName, addr.User, addr.Host
}
Upvotes: 3
Reputation: 1161
Here's what I came up with. It assumes that a valid email address must have one and only one '@' sign in it:
public List<MailAddress> ParseAddresses(string field)
{
var tokens = field.Split(',');
var addresses = new List<string>();
var tokenBuffer = new List<string>();
foreach (var token in tokens)
{
tokenBuffer.Add(token);
if (token.IndexOf("@", StringComparison.Ordinal) > -1)
{
addresses.Add( string.Join( ",", tokenBuffer));
tokenBuffer.Clear();
}
}
return addresses.Select(t => new MailAddress(t)).ToList();
}
Upvotes: 0
Reputation: 91
There is internal System.Net.Mail.MailAddressParser
class which has method ParseMultipleAddresses
which does exactly what you want. You can access it directly through reflection or by calling MailMessage.To.Add
method, which accepts email list string.
private static IEnumerable<MailAddress> ParseAddress(string addresses)
{
var mailAddressParserClass = Type.GetType("System.Net.Mail.MailAddressParser");
var parseMultipleAddressesMethod = mailAddressParserClass.GetMethod("ParseMultipleAddresses", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
return (IList<MailAddress>)parseMultipleAddressesMethod.Invoke(null, new object[0]);
}
private static IEnumerable<MailAddress> ParseAddress(string addresses)
{
MailMessage message = new MailMessage();
message.To.Add(addresses);
return new List<MailAddress>(message.To); //new List, because we don't want to hold reference on Disposable object
}
Upvotes: 9
Reputation: 3026
Your 2nd email example is not a valid address as it contains a comma which is not within a quoted string. To be valid it should be like: "Last, First"<[email protected]>
.
As for parsing, if you want something that is quite strict, you could use System.Net.Mail.MailAddressCollection
.
If you just want to your input split into separate email strings, then the following code should work. It is not very strict but will handle commas within quoted strings and throw an exception if the input contains an unclosed quote.
public List<string> SplitAddresses(string addresses)
{
var result = new List<string>();
var startIndex = 0;
var currentIndex = 0;
var inQuotedString = false;
while (currentIndex < addresses.Length)
{
if (addresses[currentIndex] == QUOTE)
{
inQuotedString = !inQuotedString;
}
// Split if a comma is found, unless inside a quoted string
else if (addresses[currentIndex] == COMMA && !inQuotedString)
{
var address = GetAndCleanSubstring(addresses, startIndex, currentIndex);
if (address.Length > 0)
{
result.Add(address);
}
startIndex = currentIndex + 1;
}
currentIndex++;
}
if (currentIndex > startIndex)
{
var address = GetAndCleanSubstring(addresses, startIndex, currentIndex);
if (address.Length > 0)
{
result.Add(address);
}
}
if (inQuotedString)
throw new FormatException("Unclosed quote in email addresses");
return result;
}
private string GetAndCleanSubstring(string addresses, int startIndex, int currentIndex)
{
var address = addresses.Substring(startIndex, currentIndex - startIndex);
address = address.Trim();
return address;
}
Upvotes: 4
Reputation: 12564
I decided that I was going to draw a line in the sand at two restrictions:
I also decided I'm just interested in email addresses and not display name, since display name is so problematic and hard to define, whereas email address I can validate. So I used MailAddress to validate my parsing.
I treated the To and Cc headers like a csv string, and again, anything not parseable in that way I don't worry about it.
private string GetProperlyFormattedEmailString(string emailString)
{
var emailStringParts = CSVProcessor.GetFieldsFromString(emailString);
string emailStringProcessed = "";
foreach (var part in emailStringParts)
{
try
{
var address = new MailAddress(part);
emailStringProcessed += address.Address + ",";
}
catch (Exception)
{
//wasn't an email address
throw;
}
}
return emailStringProcessed.TrimEnd((','));
}
EDIT
Further research has showed me that my assumptions are good. Reading through the spec RFC 2822 pretty much shows that the To, Cc, and Bcc fields are csv-parseable fields. So yeah it's hard and there are a lot of gotchas, as with any csv parsing, but if you have a reliable way to parse csv fields (which TextFieldParser in the Microsoft.VisualBasic.FileIO namespace is, and is what I used for this), then you are golden.
Edit 2
Apparently they don't need to be valid CSV strings...the quotes really mess things up. So your csv parser has to be fault tolerant. I made it try to parse the string, if it failed, it strips all quotes and tries again:
public static string[] GetFieldsFromString(string csvString)
{
using (var stringAsReader = new StringReader(csvString))
{
using (var textFieldParser = new TextFieldParser(stringAsReader))
{
SetUpTextFieldParser(textFieldParser, FieldType.Delimited, new[] {","}, false, true);
try
{
return textFieldParser.ReadFields();
}
catch (MalformedLineException ex1)
{
//assume it's not parseable due to double quotes, so we strip them all out and take what we have
var sanitizedString = csvString.Replace("\"", "");
using (var sanitizedStringAsReader = new StringReader(sanitizedString))
{
using (var textFieldParser2 = new TextFieldParser(sanitizedStringAsReader))
{
SetUpTextFieldParser(textFieldParser2, FieldType.Delimited, new[] {","}, false, true);
try
{
return textFieldParser2.ReadFields().Select(part => part.Trim()).ToArray();
}
catch (MalformedLineException ex2)
{
return new string[] {csvString};
}
}
}
}
}
}
}
The one thing it won't handle is quoted accounts in an email i.e. "Monkey Header"@stupidemailaddresses.com.
And here's the test:
[Subject(typeof(CSVProcessor))]
public class when_processing_an_email_recipient_header
{
static string recipientHeaderToParse1 = @"""Lastname, Firstname"" <[email protected]>" + "," +
@"<[email protected]>, [email protected], [email protected]" + "," +
@"<[email protected]>, [email protected]" + "," +
@"""""Yes, this is valid""""@[emails are hard to parse!]" + "," +
@"First, Last <[email protected]>, [email protected], First Last <[email protected]>"
;
static string[] results1;
static string[] expectedResults1;
Establish context = () =>
{
expectedResults1 = new string[]
{
@"Lastname",
@"Firstname <[email protected]>",
@"<[email protected]>",
@"[email protected]",
@"[email protected]",
@"<[email protected]>",
@"[email protected]",
@"Yes",
@"this is valid@[emails are hard to parse!]",
@"First",
@"Last <[email protected]>",
@"[email protected]",
@"First Last <[email protected]>"
};
};
Because of = () =>
{
results1 = CSVProcessor.GetFieldsFromString(recipientHeaderToParse1);
};
It should_parse_the_email_parts_properly = () => results1.ShouldBeLike(expectedResults1);
}
Upvotes: 0
Reputation: 1
I use the following regular expression in Java to get email string from RFC-compliant email address:
[A-Za-z0-9]+[A-Za-z0-9._-]+@[A-Za-z0-9]+[A-Za-z0-9._-]+[.][A-Za-z0-9]{2,3}
Upvotes: -2
Reputation: 615
// Based on Michael Perry's answer * // needs to handle [email protected], [email protected] and related syntaxes // also looks for first and last name within those email syntaxes
public class ParsedEmail
{
private string _first;
private string _last;
private string _name;
private string _domain;
public ParsedEmail(string first, string last, string name, string domain)
{
_name = name;
_domain = domain;
// [email protected], [email protected] etc. syntax
char[] chars = { '.', '_', '+', '-' };
var pos = _name.IndexOfAny(chars);
if (string.IsNullOrWhiteSpace(_first) && string.IsNullOrWhiteSpace(_last) && pos > -1)
{
_first = _name.Substring(0, pos);
_last = _name.Substring(pos+1);
}
}
public string First
{
get { return _first; }
}
public string Last
{
get { return _last; }
}
public string Name
{
get { return _name; }
}
public string Domain
{
get { return _domain; }
}
public string Email
{
get
{
return Name + "@" + Domain;
}
}
public override string ToString()
{
return Email;
}
public static IEnumerable<ParsedEmail> SplitEmailList(string delimList)
{
delimList = delimList.Replace("\"", string.Empty);
Regex re = new Regex(
@"((?<last>\w*), (?<first>\w*) <(?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*)>)|" +
@"((?<first>\w*) (?<last>\w*) <(?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*)>)|" +
@"((?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*))");
MatchCollection matches = re.Matches(delimList);
var parsedEmails =
(from Match match in matches
select new ParsedEmail(
match.Groups["first"].Value,
match.Groups["last"].Value,
match.Groups["name"].Value,
match.Groups["domain"].Value)).ToList();
return parsedEmails;
}
}
Upvotes: 0
Reputation: 14448
Here is the solution i came up with to accomplish this:
String str = "Last, First <[email protected]>, [email protected], First Last <[email protected]>, \"First Last\" <[email protected]>";
List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
if (str[c] == '@')
atIdx = c;
if (str[c] == ',')
commaIdx = c;
if (commaIdx > atIdx && atIdx > 0)
{
string temp = str.Substring(lastComma, commaIdx - lastComma);
addresses.Add(temp);
lastComma = commaIdx;
atIdx = commaIdx;
}
if (c == str.Length -1)
{
string temp = str.Substring(lastComma, str.Legth - lastComma);
addresses.Add(temp);
}
}
if (commaIdx < 2)
{
// if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
addresses.Add(str);
}
Upvotes: 2
Reputation: 7437
At the risk of creating two problems, you could create a regular expression that matches any of your email formats. Use "|" to separate the formats within this one regex. Then you can run it over your input string and pull out all of the matches.
public class Address
{
private string _first;
private string _last;
private string _name;
private string _domain;
public Address(string first, string last, string name, string domain)
{
_first = first;
_last = last;
_name = name;
_domain = domain;
}
public string First
{
get { return _first; }
}
public string Last
{
get { return _last; }
}
public string Name
{
get { return _name; }
}
public string Domain
{
get { return _domain; }
}
}
[TestFixture]
public class RegexEmailTest
{
[Test]
public void TestThreeEmailAddresses()
{
Regex emailAddress = new Regex(
@"((?<last>\w*), (?<first>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
@"((?<first>\w*) (?<last>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
@"((?<name>\w*)@(?<domain>\w*\.\w*))");
string input = "First, Last <[email protected]>, [email protected], First Last <[email protected]>";
MatchCollection matches = emailAddress.Matches(input);
List<Address> addresses =
(from Match match in matches
select new Address(
match.Groups["first"].Value,
match.Groups["last"].Value,
match.Groups["name"].Value,
match.Groups["domain"].Value)).ToList();
Assert.AreEqual(3, addresses.Count);
Assert.AreEqual("Last", addresses[0].First);
Assert.AreEqual("First", addresses[0].Last);
Assert.AreEqual("name", addresses[0].Name);
Assert.AreEqual("domain.com", addresses[0].Domain);
Assert.AreEqual("", addresses[1].First);
Assert.AreEqual("", addresses[1].Last);
Assert.AreEqual("name", addresses[1].Name);
Assert.AreEqual("domain.com", addresses[1].Domain);
Assert.AreEqual("First", addresses[2].First);
Assert.AreEqual("Last", addresses[2].Last);
Assert.AreEqual("name", addresses[2].Name);
Assert.AreEqual("domain.com", addresses[2].Domain);
}
}
There are several down sides to this approach. One is that it doesn't validate the string. If you have any characters in the string that don't fit one of your chosen formats, then those characters are just ignored. Another is that the accepted formats are all expressed in one place. You cannot add new formats without changing the monolithic regex.
Upvotes: 4
Reputation: 12580
You could use regular expressions to try to separate this out, try this guy:
^(?<name1>[a-zA-Z0-9]+?),? (?<name2>[a-zA-Z0-9]+?),? (?<address1>[a-zA-Z0-9.-_<>]+?)$
will match: Last, First [email protected]
; Last, First <[email protected]>
; First last [email protected]
; First Last <[email protected]>
. You can add another optional match in the regex at the end to pick up the last segment of First, Last <[email protected]>, [email protected]
after the email address enclosed in angled braces.
Hope this helps somewhat!
EDIT:
and of course you can add more characters to each of the sections to accept quotations etc for whatever format is being read in. As sjbotha mentioned, this could be difficult as the string that is submitted is not necessarily in a set format.
This link can give you more information about matching AND validating email addresses using regular expressions.
Upvotes: 0
Reputation: 9489
Here is how I would do it:
Upvotes: 0
Reputation: 42526
There is no generic simple solution to this. The RFC you want is RFC2822, which describes all of the possible configurations of an email address. The best you are going to get that will be correct is to implement a state-based tokenizer that follows the rules specified in the RFC.
Upvotes: 2
Reputation: 12740
There isn't really an easy solution to this. I would recommend making a little state machine that reads char-by-char and do the work that way. Like you said, splitting by comma won't always work.
A state machine will allow you to cover all possibilities. I'm sure there are many others you haven't seen yet. For example: "First Last"
Look for the RFC about this to discover what all the possibilities are. Sorry, I don't know the number. There are probably multiple as this is the kind of things that evolves.
Upvotes: 4