Benny Bottema
Benny Bottema

Reputation: 11503

How to tokenize, scan or split this string of email addresses

For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.

Here is an example of a valid input:

"[email protected],Sixpack, Joe 1 <[email protected]>, Sixpack, Joe 2 <[email protected]> ;Sixpack, Joe, 3<[email protected]> , [email protected],[email protected];[email protected];"

So there are two basic forms "[email protected]" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.

The following array shows the data needed (trailing spaces or delimiters would not be a big problem):

["[email protected]",
"Sixpack, Joe 1 <[email protected]>",
"Sixpack, Joe 2 <[email protected]>",
"Sixpack, Joe, 3<[email protected]>",
"[email protected]",
"[email protected]",
"[email protected]"]

I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?


Final solution (variation on the accepted answer):

var string = "[email protected],Sixpack, Joe 1 <[email protected]>, Sixpack, Joe 2 <[email protected]> ;Sixpack, Joe, 3<[email protected]> , [email protected],[email protected];[email protected];"

// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
  .replace(/(@.*?>?)\s*[,;]/g, "$1<|>")
  .replace(/<\|>$/,"") // remove trailing delimiter
  .split(/\s*<\|>\s*/) // split on delimiter including surround space

console.log(result)

Or in Java:

public static String[] extractEmailAddresses(String emailAddressList) {
    return emailAddressList
            .replaceAll("(@.*?>?)\\s*[,;]", "$1<|>")
            .replaceAll("<\\|>$", "")
            .split("\\s*<\\|>\\s*");
}

Upvotes: 1

Views: 1364

Answers (3)

linden2015
linden2015

Reputation: 887

This pattern works for your provided examples:

([^@,;\s]+@[^@,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^@,;\s]+@[^@,;\s]+)>

([^@,;\s]+@[^@,;\s]+)   # email defined by an @ with connected chars except ',' ';' and white-space
|                       # OR
(?:$|\s*[,;])(?:\s*)    # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?)                   # name
<([^@,;\s]+@[^@,;\s]+)> # email enclosed by lt-gt

PCRE Demo

Upvotes: 2

Tezra
Tezra

Reputation: 8833

Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.

Here is a javascript example, but Java's repalceAll and split can do the same job.

var string = "[email protected],Joe Sixpack <[email protected]>, Sixpack, Joe <[email protected]> ;Sixpack, Joe<[email protected]> , [email protected],[email protected];[email protected];"


const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)

Upvotes: 1

Bamieh
Bamieh

Reputation: 10916

since you are not validating, i assume that the email addresses are valid. Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.

    var string = "[email protected],Sixpack, Joe 1 <[email protected]>, Sixpack, Joe 2 <[email protected]> ;Sixpack, Joe, 3<[email protected]> , [email protected],[email protected];[email protected];"



    const result = string.match(/(.*?@.*?\..*?)[,;]/g)
    console.log(result)

Upvotes: 2

Related Questions