Reputation: 345
I am new to bash and awk, and I have spent days trying to learn it. I think I am very close to the solution, but not completely there. So, request for your help. Do note, I do not wish to use grep, since I find it to be much slower.
I have huge number of text files, each several hundred MB in size. Unfortunately, they are not all fully standardized in any one format. Plus there is a lot of legacy in here, and a lot of junk and garbled text. I wish to check all of these files to find rows with a valid email ID, and if it exists print it to a file. Do note I am using Cygwin on windows10 (not sure if that matters)
Text file:
[email protected],address
#[email protected];address
[email protected];address µÖ
[email protected];username;address
[email protected];username
[email protected],username;address [spaces at the start of the row]
[email protected]|username|address [tabs at the start of the row]
Code:
awk -F'[,|;: \t]+' '{
gsub(/^[ \t]+|[ \t]+$/, "")
if (NF>1 && tolower($1) ~ /[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+/)
{
r=gensub("[,|;: \t]+",":",1,$0)
print r > "file_good"
}
else
print $0 > "file_ignore"
}' *.txt
Expected output into: file_good
[email protected]:username;address
[email protected]:username
[email protected]:username;address
[email protected]:username|address
Issue with the code:
Any help would be much appreciated!
Upvotes: 0
Views: 565
Reputation:
Whilst there are other complexities relating to the stated goal, the main reason why your original awk program did not work as expected is that the regex lacked anchoring:
tolower($1) ~ /^[0-9a-z_\-\.\+]+@[0-9a-z_\-\.]+\.[a-z0-9]+$/
$1 ~ /.../
is changed to $1 ~ /^...$/
. Also the r=gensub
part of original program doesn't appear to be doing anything useful (I didn't see r
anywhere else). gensub
is specific to GNU awk
- it could be that in this case all that's needed is sub
.
Upvotes: 1
Reputation: 614
This isn't a complete solution, but I can think of a few preliminary steps which will probably make the rest of the process much simpler.
cat textfile | tr ';' '\n' | tr ',' '\n' | tr '\|' '\n' > textfile2
mv textfile2 textfile
sed -n '/\@/p' textfile > emails
sed -i '/\@/d' textfile
What that will do, is try and turn all of those delimiters into newlines, which will have the effect of putting the delimited fields on seperate lines. After that, a brute force search for all lines containing a '@' symbol will hopefully give you at least a few email addresses, which you can then dump out to a seperate file, and delete from the original. From there, you can probably build a similar heuristic for pulling out the usernames and snail addresses, if you can find a common anchor.
In my experience, regular expressions can induce literal migraines. Wherever possible, I try and use the simplest solution I can. As mentioned, this most likely isn't perfect; but it's a start.
Upvotes: 0