Ra1nm4n
Ra1nm4n

Reputation: 3

Partial string search between two files using AWK

I have been trying to re-write an egrep command using awk to improve performance but haven't been successful. The egrep command performs a simple case insensitive search of the records in file1 against (partial matches in) file2. Below is the command and sample output.

file1 contains:

Abc
xyz
123
blah
hh
a,b

file2 contains:

abc de
xyz
123
456
blah
test1
abdc
abc,def,123
kite
a,b,c

Original command : egrep -i -f file1 file2

Original (egrep) command output :

$ egrep -i -f file1 file2
abc de
xyz
123
blah
abc,def,123
a,b,c

I would like to use AWK to rewrite the command to do the same operation. I have tried the below but it is performing a full record match and not partial like grep does.

Modified command in awk : awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2

Modified command (awk) output:

$ awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
xyz
123
blah

This excludes the records which had partial matches for the string "abc". Any help to fix the awk command please? Thanks in advance.

Upvotes: 0

Views: 76

Answers (2)

Renaud Pacalet
Renaud Pacalet

Reputation: 29137

I would be a bit surprised that it's significantly faster than egrep but you can try this:

$ awk 'NR==FNR {r=r ((r=="")?"":"|") tolower($0);next} tolower($0)~r' file1 file2
abc de
xyz
123
blah
abc,def,123

Explanation: first build the r1|r2|...|rn regular expression from the content of file1 and store it in awk variable r. Then print all lines of file2 that match it, thanks to the ~ match operator.

If you have GNU awk you can use its IGNORECASE variable instead of tolower:

$ awk -v IGNORECASE=1 'NR==FNR{r=r ((r=="")?"":"|") $0;next} $0~r' file1 file2
abc de
xyz
123
blah
abc,def,123

And with GNU awk it could be that forcing the type of r to regexp instead of string leads to better performance. The manual says:

Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is "regexp constants," for several reasons:
...
It is more efficient to use regexp constants. 'awk' can note that you have supplied a regexp and store it internally in a form that makes pattern matching more efficient. When using a string constant, 'awk' must first convert the string into this internal form and then perform the pattern matching.

In order to do this you can try:

$ awk -v IGNORECASE=1 'NR==FNR {s=s ((s=="")?"":"|") $0;next}
    FNR==1 && NR!=FNR {r=@//;sub(//,s,r);print typeof(r),r} $0~r' file1 file2
regexp Abc|xyz|123|blah|hh
abc de
xyz
123
blah
abc,def,123

(r=@// forces variable r to be of type regexp and sub(//,s,r) does not change this)

Note: just like with your egrep attempts, the lines of file1 are considered as regular expressions, not simple text strings to search for. So, if one line in file1 is .*, all lines in file2 will match, not just the lines containing substring .*.

Upvotes: 0

oguz ismail
oguz ismail

Reputation: 50760

Use index like this for a partial literal match:

awk '
NR == FNR {
  needles[tolower($0)]
  next
}
{
  haystack = tolower($0)
  for (needle in needles) {
    if (index(haystack, needle)) {
      print
      break
    }
  }
}' file1 file2

Upvotes: 0

Related Questions