awk filter rows with valid email addresses

Question

I am new to bash and awk, and I have spent days trying to learn it. I think I am very close to the solution, but not completely there. So, request for your help. Do note, I do not wish to use grep, since I find it to be much slower.

I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists print it to a file. Do note I am using Cygwin on windows10 (not sure if that matters)

Text file:

!bar@foo.com,address
#john@foo.com;address
john@foo.com;address µÖ
email1@foo.com;username;address
email2@foo.com;username
  email3@foo.com,username;address   [spaces at the start of the row]
 email4@foo.com|username|address   [tabs at the start of the row]

Code:

awk -F'[,|;: 	]+' '{
    gsub(/^[ 	]+|[ 	]+$/, "")
    if (NF>1 && tolower($1) ~ /[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+/)
    {
        r=gensub("[,|;: 	]+",":",1,$0)
        print r > "file_good"
    }
    else
        print $0 > "file_ignore"
}' *.txt

Expected output into: file_good

email1@foo.com:username;address
email2@foo.com:username
email3@foo.com:username;address
email4@foo.com:username|address

Issue with the code:

I can't find a way to filter out non-ascii characters (non printable characters).
For some reason the code allowed rows without a valid email address. For example: !bar@foo.com ; #john@foo.com ; etc

Any help would be much appreciated!

user13586221 · Accepted Answer

Whilst there are other complexities relating to the stated goal, the main reason why your original awk program did not work as expected is that the regex lacked anchoring:

tolower($1) ~ /^[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+$/

$1 ~ /.../ is changed to $1 ~ /^...$/. Also the r=gensub part of original program doesn't appear to be doing anything useful (I didn't see r anywhere else). gensub is specific to GNU awk - it could be that in this case all that's needed is sub.

awk filter rows with valid email addresses

Answers (2)

Related Questions