Substring pattern matching in two files

Question

I have an input flat file like this with many rows:

Apr  3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n5ut5s 1 0 Message-Type=Authen OK,User-Name=joe7@it.test.com,NAS-  IP-Address=4.196.63.55,Caller-ID=az-4d-31-89-92-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr  3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n6ut5s 1 0 Message-Type=Authen OK,User-Name=bobe@jg.test.com,NAS-IP-Address=4.197.43.55,Caller-ID=az-4d-4q-x8-92-80,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr  3 13:30:02 abg8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=jerry777@it.test.com,NAS-IP-Address=7.196.63.55,Caller-ID=az-4d-n6-4e-y2-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr  3 13:30:02 aca8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=frc777o.@it.test.com,NAS-IP-Address=4.196.263.55,Caller-ID=a4-4e-31-99-92-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr  3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=frc77@xed.test.com,NAS-IP-Address=4.136.163.55,Caller-ID=az-4d-4w-b5-s2-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,

I'm trying to grep the email addresses from input file to see if they already exist in the master file.

Master flat file looks like this:

a44e31999290;frc777o.@it.test.com;20150403
az4d4qx89280;bobe@jg.test.com;20150403
0dbgd0fed04t;rrfuf@us.test.com;20150403
28cbe9191d53;rttuu4en@us.test.com;20150403
az4d4wb5s290;frc77@xed.test.com;20150403
d89695174805;ccis6n@cn.test.com;20150403

If the email doesn't exist in master I want a simple count.

So using the examples I hope to see: count=3, because bobe@jg.test.com and frc77@xed.test.com already exist in master but the others don't.

I tried various combinations of grep, example below from last tests but it is not working.. I'm using grep within a perl script to first capture emails and then count them but all I really need is the count of emails from input file that don't exist in master.

grep -o -P '(?<=User-Name=\).*(?=,NAS-IP-)' $infile $mstr > $new_emails;

Any help would be appreciated, Thanks.

fedorqui · Accepted Answer

I would use this approach in awk:

$ awk 'FNR==NR {FS=";"; a[$2]; next}
       {FS="[,=]"; if ($4 in a) c++}
       END{print c}' master file
3

This works by setting different field separators and storing / matching the emails. Then, printing the final sum.

For master file we use ; and get the 2nd field:

$ awk -F";" '{print $2}' master 
frc777o.@it.test.com
bobe@jg.test.com
rrfuf@us.test.com
rttuu4en@us.test.com
frc77@xed.test.com
ccis6n@cn.test.com

For file file (the one with all the info) we use either , or = and get the 4th field:

$ awk -F[,=] '{print $4}' file
joe7@it.test.com
bobe@jg.test.com
jerry777@it.test.com
frc777o.@it.test.com
frc77@xed.test.com

Substring pattern matching in two files

Answers (2)

Related Questions