Reputation: 13
I would like to validate emails from text files in a directory using bash
.
My regex:
grep -Eoh \
"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,8}\b" * \
| sort -u > mail_list
This regex satisfies all my requirements but it cannot exclude addresses such:
^%&[email protected]
and
[email protected]
(with 2 and more dots).
These kinds of addresses should be excluded.
How can I modify this regex to exclude these types of emails?
I can use only one expression for this task.
Upvotes: 0
Views: 144
Reputation: 4550
Try this regex:
'\b[A-Za-z0-9]+[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,8}\b'
I added an alphanumeric group to the front, to force emails to begin with at least one letter or number, after which they may also have symbols.
After the @
sign, I added a group which can contain any number of letters or numbers, followed by one period. However, this group can be repeated multiple times, thus being able to match long.domain.name.com
.
Finally, the regex ends with the final string as you had it, for example 'com'
.
Since \b
matches a word boundary, and the symbols ^%&
are not considered part of the word 'blah', the above will still match [email protected]
even though it is preceded by undesired characters. To avoid this, use a Negative Lookbehind. This will require using grep -P
instead of -E
:
grep -P '(?<![%&^])\b[A-Za-z0-9]+[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,8}\b'
The (?<![%&^])
tells regex to match further only if the string is not preceded by the characters %&^
.
Upvotes: 0
Reputation: 74595
The email address ^%&[email protected]
is actually a valid email address.
You can do this in Perl using the Email::Valid
module (this assumes that each entry is on a new line):
perl -MEmail::Valid -ne 'print if Email::Valid->address($_)' file1 file2
not email
[email protected]
not email
[email protected]
^%&[email protected]
[email protected]
[email protected]
[email protected]
^%&[email protected]
Upvotes: 1