biolightning
biolightning

Reputation: 61

Regex to return last 3 characters of matching pattern

I am using grep to search through text files containing 88 character long MRZs (machine readable zones). Within the text file they are preceeded by a semicolon. I only want to get the substring of characters 3-5 from the string.

This is my pattern:

egrep --include *.txt -or . -e ";[A-Z][A-Z0-9<][A-Z<]{3}"

This is a textfile:

text is here;P<RUSIVAN<<DEL<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<F64D123456RUS7404124F131009734P41234<<<<<<<8  ;2019-02-08

This is my output:

;P<RUS

This is my desired output:

RUS

The semicolon introduces the MRZ. It starts with a uppercase letter, followed by either an uppercase letter, a digit or a filler character <. Then follows the 3 digit country code that can contain uppercase letters or filler characters <.

This pattern works fine, but what I only want returned is the last 3 digits I am quantifying. Is there a way to get only the last 3 characters of a matching pattern? In the sample text file the desired output would be RUS. Thank you!

Upvotes: 1

Views: 1229

Answers (2)

Ed Morton
Ed Morton

Reputation: 204721

Is this all you're trying to do?

$ awk -F';' '{print substr($2,3,3)}' file
RUS

$ sed -E 's/[^;]*;..(.{3}).*/\1/' file
RUS

If not then edit your question to provide more truly representative sample input/output.

The UNIX command to find files is named find, btw, not grep. I know the GNU guys added a bunch of options for finding files to grep but just don't use them as they make your grep command unnecessarily complicated (and inconsistent with the other UNIX text processing tools) as it then needs arguments to find files as well as to g/re/p within the files. So your command line if you're using grep should be:

find . -name '*.txt' -exec grep 'stuff' {} +

not:

egrep --include *.txt -or . -e 'stuff'

and do the same for any other tool:

find . -name '*.txt' -exec grep 'stuff' {} +
find . -name '*.txt' -exec sed  'stuff' {} +
find . -name '*.txt' -exec awk  'stuff' {} +

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163632

If you could use GNU Grep, you can make use of \K which will no longer include any of the previous matched characters in the match and then match your character class 3 times:

grep -roP --include=*.txt ";[A-Z][A-Z0-9<]\K[A-Z<]{3}"

Upvotes: 1

Related Questions