Reputation: 19
this is a similar question to some that are already out there, but couldn't find one that answered my question specifically, so thank you for any assistance/insight.
So I have a text file that I've opened in TextWrangler (popular Mac text editor) with email names and addresses. sample records:
Timmy Turner <[email protected]>
"[email protected]" <[email protected]>
Susan Alder <[email protected]>,
[email protected]
So some email addresses with names preceding them, most emails enclosed by <> brackets, and some emails just by themselves, already correct, and some with commas after. I want to do a global process that will automate the process of getting this end result, either via Grep or something similar:
[email protected]
[email protected]
[email protected]
[email protected]
Thanks for any insight!
Upvotes: 1
Views: 3116
Reputation: 4069
TL;DR
Search:
^.*<?\b([a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@((?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])\b>?.*$
Replace:
\1@\2
Explanation:
According to this article, the RFC 5322 specification gives an official definition for a valid email address.
Their string, simplified for use in TextWrangler, would be:
Search:
([a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@((?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Replace:
\1@\2
By itself, it would match:
Timmy Turner <[email protected]>
"[email protected]" <[email protected]>
Susan Alder <[email protected]>,
[email protected]
While that DOES match your example email strings, it doesn't give you the exact result you want, since it's also including "[email protected]"
, which should be stripped out.
You can use some filtering before and after it, if you know a few things:
If yes to 1 and 2, and no to 3, prepend that string with ^.*<?\b
, and append it with \b>?.*$
.
This starts at the beginning of the line, searches for 0 or more characters, an optional opening bracket, and then a word boundary that starts the actual email address.
Then afterward, look for the word boundary on the last character of the email address, an optional closing bracket, and zero or more characters till the end of the line.
Replacing that with \1@\2
will clean up the entire line to only contain the email address.
Upvotes: 1
Reputation: 3572
sed might work better. You can use a regex to remove the patterns that you don't want:
sed -e "s|.*<||" -e "s|>.*||" your_file.txt > new_file.txt
Upvotes: 1