Reputation: 3
I have been trying to re-write an egrep command using awk to improve performance but haven't been successful. The egrep command performs a simple case insensitive search of the records in file1 against (partial matches in) file2. Below is the command and sample output.
file1 contains:
Abc
xyz
123
blah
hh
a,b
file2 contains:
abc de
xyz
123
456
blah
test1
abdc
abc,def,123
kite
a,b,c
Original command :
egrep -i -f file1 file2
Original (egrep) command output :
$ egrep -i -f file1 file2
abc de
xyz
123
blah
abc,def,123
a,b,c
I would like to use AWK to rewrite the command to do the same operation. I have tried the below but it is performing a full record match and not partial like grep does.
Modified command in awk :
awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
Modified command (awk) output:
$ awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
xyz
123
blah
This excludes the records which had partial matches for the string "abc". Any help to fix the awk command please? Thanks in advance.
Upvotes: 0
Views: 76
Reputation: 29137
I would be a bit surprised that it's significantly faster than egrep
but you can try this:
$ awk 'NR==FNR {r=r ((r=="")?"":"|") tolower($0);next} tolower($0)~r' file1 file2
abc de
xyz
123
blah
abc,def,123
Explanation: first build the r1|r2|...|rn
regular expression from the content of file1
and store it in awk
variable r
. Then print all lines of file2
that match it, thanks to the ~
match operator.
If you have GNU awk
you can use its IGNORECASE
variable instead of tolower
:
$ awk -v IGNORECASE=1 'NR==FNR{r=r ((r=="")?"":"|") $0;next} $0~r' file1 file2
abc de
xyz
123
blah
abc,def,123
And with GNU awk
it could be that forcing the type of r
to regexp
instead of string
leads to better performance. The manual says:
Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is "regexp constants," for several reasons:
...
It is more efficient to use regexp constants. 'awk' can note that you have supplied a regexp and store it internally in a form that makes pattern matching more efficient. When using a string constant, 'awk' must first convert the string into this internal form and then perform the pattern matching.
In order to do this you can try:
$ awk -v IGNORECASE=1 'NR==FNR {s=s ((s=="")?"":"|") $0;next}
FNR==1 && NR!=FNR {r=@//;sub(//,s,r);print typeof(r),r} $0~r' file1 file2
regexp Abc|xyz|123|blah|hh
abc de
xyz
123
blah
abc,def,123
(r=@//
forces variable r
to be of type regexp
and sub(//,s,r)
does not change this)
Note: just like with your egrep
attempts, the lines of file1
are considered as regular expressions, not simple text strings to search for. So, if one line in file1
is .*
, all lines in file2
will match, not just the lines containing substring .*
.
Upvotes: 0
Reputation: 50760
Use index
like this for a partial literal match:
awk '
NR == FNR {
needles[tolower($0)]
next
}
{
haystack = tolower($0)
for (needle in needles) {
if (index(haystack, needle)) {
print
break
}
}
}' file1 file2
Upvotes: 0