Reputation: 1003
I have a text file, and I need to identify a certain pattern in one field. I am using AWK, and trying to use the match() function.
The requirement is I need to see if the following pattern exists in a string of digits
??????1?
??????3?
??????5?
??????7?
ie I am only interested in the last but one digit being a 1, 3, 5, or a 7.
I have a solution, which looks like this;
b = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]1[0-9]")
c = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]3[0-9]")
d = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]5[0-9]")
e = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]7[0-9]")
if (b || c || d || e)
{
print "Found a match" $23
}
I think though I should be able to write the regex more succinctly like this;
b = match($23, "[0-9]{6}1[0-9]")
but this does not work.
Am I missing something, or are my regex skills (which are not great), really all that bad?
Thanks in anticipation
Upvotes: 0
Views: 841
Reputation: 41446
Here is one awk
solution:
awk -v FS="" '$7~/(1|3|5|7)/' file
By setting FS
to nothing, every character becomes a field. We can then test field #7.
As Tom posted.
awk -v FS="" '$7~/[1357]/' file
Upvotes: 0
Reputation: 203149
The regex delimiter is /.../
, not "..."
. When you use quotes in an RE context, you're telling awk that there's an RE stored inside a string literal and that string literal gets parsed twice, once when the script is read and then again when it's executed which makes your RE specification that much more complicated to accommodate that double parsing.
So, do not write:
b = match($23, "[0-9]{6}1[0-9]")
write:
b = match($23, /[0-9]{6}1[0-9]/)
instead.
That's not your problem though. The most likely problem you have is that you are calling a version of awk that does not support RE-intervals like {6}
. If you are using an older version of GNU awk, then you can enable that functionality by adding the --re-interval
flag:
awk --re-interval '...b = match($23, /[0-9]{6}1[0-9]/)...'
but whether it's that or you're using an awk that just doesnt support RE_intervals, the best thing to do is get a newer version of gawk.
Finally, your whole script can be reduced to:
awk --re-interval '$23 ~ /[0-9]{6}[1357][0-9]/{print "Found a match", $23}'
Change [0-9]
to [[:digit:]]
for locale-independence if you like.
The reason why RE intervals weren't supported by default in gawk until recently is that old awk didn't support them so a script that had an RE of a{2}b
when executed in old awk would have been looking for literally those 5 chars and gawk didn't want old scripts to quietly break when executed in gawk instead of old awk. A few release back the gawk guys rightly decided to take the plunge an enable RE intervals by default for our convenience over backward compatibility.
Upvotes: 3