Dave Johnson
Dave Johnson

Reputation: 49

How to extract only specific strings from each line of a file using awk?

I was wondering if there a generic way to extract a specific string which by design is an eleven characters alphanumeric string using awk approach? for ex-

cat ext.txt

This is a sample field where the code is MGTCBEBEECL for NR
This is a sample field where the code is MGTCBEBEE01 for NR
This field must be 030 when Rule_1 = 'FR' and Rule_2  is 'EUROFRANSBI' or 'EURO_NEAR' and code is PARBFRPPXXX 
This field must be 0186 when Rule_1 = 'FR' and Rule_2  is 'EUROFRANSBI' or  'EURO_NEAR' and code is CITIFRPPXXX for the NR
For NFNC with Rule_1 is CA and Rule_2 is Universal and business code is null and official code must be 'CIBCCATTXXX'

I want to only extract the codes:-

MGTCBEBEECL 
MGTCBEBEE01 
PARBFRPPXXX 
CITIFRPPXXX 
CIBCCATTXXX

There are almost 100 such lines from which i am hoping to extract these distinct strings, but i am at my wits end how to make it more generic and non-redundant hence seeking this community's assistance!

Upvotes: 0

Views: 1080

Answers (5)

RavinderSingh13
RavinderSingh13

Reputation: 133428

We could use match function of awk, written and tested in GNU awk should work in any awk. Simple explanation would be using match function of awk where we can use regex [[:alnum:]]{11} to match 11 continuous alphanumeric in each line and if a TRUE match is found then printing sub string for matched value.

awk  'match($0,/[[:alnum:]]{11}/){print substr($0,RSTART,RLENGTH)}' Input_file

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203149

Using any sed that has -E to enable EREs, e.g. GNU and BSD seds:

$ sed -En "s/.*code (is|must be) '?([[:upper:][:digit:]]+).*/\2/p" file
MGTCBEBEECL
MGTCBEBEE01
PARBFRPPXXX
CITIFRPPXXX
CIBCCATTXXX

Upvotes: 1

Carlos Pascual
Carlos Pascual

Reputation: 1126

There is a way with GNU awk using FPAT:

awk -v FPAT='[[:alnum:]]{11}' '{print $NF}' file
MGTCBEBEECL
MGTCBEBEE01
PARBFRPPXXX
CITIFRPPXXX
CIBCCATTXXX
  • Setting the FPAT as '[[:alnum:]]{11}' GNU awk can handle fields that contain alphanumeric string with eleven characters.
  • and {print $NF} for printing the desired fields.

Upvotes: 2

Thor
Thor

Reputation: 47089

With the current examples you can do it with grep like this:

<ext.txt grep -oE "(code is|code must be) '?[A-Z0-9]{11}'?" | 
tr -d "'"                                                   |
grep -o '[^ ]*$'

Output:

MGTCBEBEECL
MGTCBEBEE01
PARBFRPPXXX
CITIFRPPXXX
CIBCCATTXXX

Upvotes: 0

Luuk
Luuk

Reputation: 14899

Using gawk:

gawk -F "[ ']" 'BEGIN{ r=@/[A-Z]{11}/ }r{ for (i=1; i<=NF;i++){ if($i~r) print $i} }' ext.txt
  • -F "[ ']" use space or ' as field separator (to also find codes like 'CIBCCATTXXX')
  • r=@/[A-Z]{11}/ assign the used regular expression (because it's used twice in the script
  • for(... loop over all the field in a line, and print the field when it matches the regular expression.

output:

MGTCBEBEECL
EUROFRANSBI
PARBFRPPXXX
EUROFRANSBI
CITIFRPPXXX
CIBCCATTXXX

Upvotes: 1

Related Questions