I Z
I Z

Reputation: 5927

What should I use in bash script to extract email addresses from noisy lines in file?

I have a file that has one email address per line. Some of them are noisy, i.e. contain junk characters before and/or after the address, e.g.

[email protected]<mailto
<[email protected]>
<[email protected]>Mobile
<[email protected]>
<[email protected]
[email protected]

How can I extract the right address from each line of the file in a loop like this?

for l in `cat file_of_email_addresses`
do
     # do magic here to extract address form $l
done

It looks like that if I get garbage before the address then it always ends with lt;, and if I get it after then it always starts with &amp

Upvotes: 1

Views: 98

Answers (2)

Cyrus
Cyrus

Reputation: 88583

Try this with GNU grep:

grep -Po '[\w.-]+@[\w.-]+' file

Output:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

It's not perfect but perhaps it is sufficient for your task.

Upvotes: 1

John Bollinger
John Bollinger

Reputation: 180113

It would be better to use a tool that's built for pattern matching, such as sed. It would help to first decode the data, as Etan suggested, but if you're willing to assume

  • that the leading segments you want to remove will always end with a ;,
  • that the trailing segments you want to remove will always begin with an &,
  • that the desired addresses will not contain either of those characters, and
  • that every line will contain exactly one @, and that in the address,

then you can do this:

sed 's/^\([^@]*;\)\?\([^&;]*@[^&;]*\).*/\2/' file_of_email_addresses

Upvotes: 0

Related Questions