Reputation: 61
I am using grep to search through text files containing 88 character long MRZs (machine readable zones). Within the text file they are preceeded by a semicolon. I only want to get the substring of characters 3-5 from the string.
This is my pattern:
egrep --include *.txt -or . -e ";[A-Z][A-Z0-9<][A-Z<]{3}"
This is a textfile:
text is here;P<RUSIVAN<<DEL<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<F64D123456RUS7404124F131009734P41234<<<<<<<8 ;2019-02-08
This is my output:
;P<RUS
This is my desired output:
RUS
The semicolon introduces the MRZ. It starts with a uppercase letter, followed by either an uppercase letter, a digit or a filler character <
. Then follows the 3 digit country code that can contain uppercase letters or filler characters <
.
This pattern works fine, but what I only want returned is the last 3 digits I am quantifying. Is there a way to get only the last 3 characters of a matching pattern?
In the sample text file the desired output would be RUS
.
Thank you!
Upvotes: 1
Views: 1229
Reputation: 204721
Is this all you're trying to do?
$ awk -F';' '{print substr($2,3,3)}' file
RUS
$ sed -E 's/[^;]*;..(.{3}).*/\1/' file
RUS
If not then edit your question to provide more truly representative sample input/output.
The UNIX command to find files is named find
, btw, not grep
. I know the GNU guys added a bunch of options for finding files to grep but just don't use them as they make your grep command unnecessarily complicated (and inconsistent with the other UNIX text processing tools) as it then needs arguments to find
files as well as to g/re/p
within the files. So your command line if you're using grep should be:
find . -name '*.txt' -exec grep 'stuff' {} +
not:
egrep --include *.txt -or . -e 'stuff'
and do the same for any other tool:
find . -name '*.txt' -exec grep 'stuff' {} +
find . -name '*.txt' -exec sed 'stuff' {} +
find . -name '*.txt' -exec awk 'stuff' {} +
Upvotes: 0
Reputation: 163632
If you could use GNU Grep, you can make use of \K
which will no longer include any of the previous matched characters in the match and then match your character class 3 times:
grep -roP --include=*.txt ";[A-Z][A-Z0-9<]\K[A-Z<]{3}"
Upvotes: 1