rogerwhite
rogerwhite

Reputation: 345

awk filter rows with valid email addresses

I am new to bash and awk, and I have spent days trying to learn it. I think I am very close to the solution, but not completely there. So, request for your help. Do note, I do not wish to use grep, since I find it to be much slower.

I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists print it to a file. Do note I am using Cygwin on windows10 (not sure if that matters)

Text file:

[email protected],address
#[email protected];address
[email protected];address µÖ
[email protected];username;address
[email protected];username
  [email protected],username;address   [spaces at the start of the row]
 [email protected]|username|address   [tabs at the start of the row]

Code:

awk -F'[,|;: \t]+' '{
    gsub(/^[ \t]+|[ \t]+$/, "")
    if (NF>1 && tolower($1) ~ /[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+/)
    {
        r=gensub("[,|;: \t]+",":",1,$0)
        print r > "file_good"
    }
    else
        print $0 > "file_ignore"
}' *.txt

Expected output into: file_good

[email protected]:username;address
[email protected]:username
[email protected]:username;address
[email protected]:username|address

Issue with the code:

  1. I can't find a way to filter out non-ascii characters (non printable characters).
  2. For some reason the code allowed rows without a valid email address. For example: [email protected] ; #[email protected] ; etc

Any help would be much appreciated!

Upvotes: 0

Views: 565

Answers (2)

user13586221
user13586221

Reputation:

Whilst there are other complexities relating to the stated goal, the main reason why your original awk program did not work as expected is that the regex lacked anchoring:

tolower($1) ~ /^[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+$/

$1 ~ /.../ is changed to $1 ~ /^...$/. Also the r=gensub part of original program doesn't appear to be doing anything useful (I didn't see r anywhere else). gensub is specific to GNU awk - it could be that in this case all that's needed is sub.

Upvotes: 1

petrus4
petrus4

Reputation: 614

This isn't a complete solution, but I can think of a few preliminary steps which will probably make the rest of the process much simpler.

cat textfile | tr ';' '\n' | tr ',' '\n' | tr '\|' '\n' > textfile2
mv textfile2 textfile
sed -n '/\@/p' textfile > emails
sed -i '/\@/d' textfile

What that will do, is try and turn all of those delimiters into newlines, which will have the effect of putting the delimited fields on seperate lines. After that, a brute force search for all lines containing a '@' symbol will hopefully give you at least a few email addresses, which you can then dump out to a seperate file, and delete from the original. From there, you can probably build a similar heuristic for pulling out the usernames and snail addresses, if you can find a common anchor.

In my experience, regular expressions can induce literal migraines. Wherever possible, I try and use the simplest solution I can. As mentioned, this most likely isn't perfect; but it's a start.

Upvotes: 0

Related Questions