Alan Mackey
Alan Mackey

Reputation: 105

AWK - pattern recognition

So I have this file, horribly formatted:

cat file1 

220914230708E2022091416195068167642220000039TE  OKaaaaaaaaaa          0000000017316354827094010600 aaaaaaaaaa               001                                  2022091416123467540807620001105TE  OKbbbbbbbbbb          0000000292934354487119918680 bbbbbbbbbb               001                                  2022091416483567002141420000731TE  OKDFDFJFSDFHS          0000000199137867325032383540                                                                2022091419204463543285020000412TE  OKcccccccccc          0000000111113867351043007780 cccccccccc               1EP                                  2022091419372363503707220000233TE  OKddddddddddd          0000000067822353828105648630 ddddddddddd               001

       

And I would like to make it more readable.

I noticed the first field always ends up with "TE", so I tried this (and it almost worked).

awk ' BEGIN { RS = "TE" } { if ( $0 ~ "OK" ) print "TE" $0 }' file1

TE  OKaaaaaaaaaa          0000000017316354827094010600 aaaaaaaaaa               001                                  2022091416123467540807620001105
TE  OKbbbbbbbbbb          0000000292934354487119918680 bbbbbbbbbb               001                                  2022091416483567002141420000731
TE  OKDFDFJFSDFHS          0000000199137867325032383540                                                                2022091419204463543285020000412
TE  OKcccccccccc          0000000111113867351043007780 cccccccccc               1EP                                  2022091419372363503707220000233
TE  OKddddddddddd          0000000067822353828105648630 ddddddddddd               001  

 

This is what I'm trying to accomplish:

2022091416123467540807620001105TE               OKbbbbbbbbbb            0000000292934354487119918680    bbbbbbbbbb               001                                  
2022091416483567002141420000731TE               OKDFDFJFSDFHS           0000000199137867325032383540                                                                
2022091419204463543285020000412TE               OKcccccccccc            0000000111113867351043007780    cccccccccc               1EP                                  
2022091419372363503707220000233TE               OKddddddddddd           0000000067822353828105648630    ddddddddddd              001  
         

I know the problem is that I'm not correctly identifying the pattern, but I'm not sure what to do next.

Any help would be greatly appreciated.

Upvotes: 1

Views: 106

Answers (4)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2807

if you're okay with 2 awk calls with a pipe, then here's an awk-based solution without needing custom functions, loops, arrays, END block, or vendor-proprietary solutions :

mawk 'gsub("[^ \t]+TE", "\n&") + sub("^\n[^E]+E",_)^_' | mawk NF=NF OFS='\t'  

|

2022091416195068167642220000039TE   OKaaaaaaaaaa    0000000017316354827094010600    aaaaaaaaaa  001
2022091416123467540807620001105TE   OKbbbbbbbbbb    0000000292934354487119918680    bbbbbbbbbb  001
2022091416483567002141420000731TE   OKDFDFJFSDFHS   0000000199137867325032383540
2022091419204463543285020000412TE   OKcccccccccc    0000000111113867351043007780    cccccccccc  1EP
2022091419372363503707220000233TE   OKddddddddddd   0000000067822353828105648630    ddddddddddd 001

Upvotes: 0

RavinderSingh13
RavinderSingh13

Reputation: 133428

With your shown samples please try following awk code. Using awk + column combination here to get the values in exact tabular format.

awk '
BEGIN{OFS="\t"}
{ $1=$1 }
1
' <(awk '{gsub(/[^[:space:]]+TE[[:space:]]+/,"\n&");sub(/^\n/,"")} 1' Input_file) | 
column -t -s $'\t'

OR to test code with condition which is having to check line has OK string or not try:

awk '
BEGIN{OFS="\t"}
/OK/{ $1=$1; print}
' <(awk '{gsub(/[^[:space:]]+TE[[:space:]]+/,"\n&");sub(/^\n/,"")} 1' Input_file) | 
column -t -s $'\t'

Upvotes: 3

glenn jackman
glenn jackman

Reputation: 246744

I like @Daweo's answer best. An iterative solution that would work with any awk:

awk '
  function emit() {
    if (line ~ /OK/) print line
    line = ""
  }
  {
    for (i=1; i<=NF; i++) {
      if (i > 1 && $i ~ /TE$/) emit()
      line = line $i " "
    }
    emit()
  }
' file | column -t

Upvotes: 2

Daweo
Daweo

Reputation: 36360

I would harness GNU AWK for this task following way, let file.txt content be

220914230708E2022091416195068167642220000039TE  OKaaaaaaaaaa          0000000017316354827094010600 aaaaaaaaaa               001                                  2022091416123467540807620001105TE  OKbbbbbbbbbb          0000000292934354487119918680 bbbbbbbbbb               001                                  2022091416483567002141420000731TE  OKDFDFJFSDFHS          0000000199137867325032383540                                                                2022091419204463543285020000412TE  OKcccccccccc          0000000111113867351043007780 cccccccccc               1EP                                  2022091419372363503707220000233TE  OKddddddddddd          0000000067822353828105648630 ddddddddddd               001

then

awk 'BEGIN{RS="[0-9]*TE"}/OK/{print prev, $0}{prev=RT}' file.txt

gives output

2022091416195068167642220000039TE   OKaaaaaaaaaa          0000000017316354827094010600 aaaaaaaaaa               001                                  
2022091416123467540807620001105TE   OKbbbbbbbbbb          0000000292934354487119918680 bbbbbbbbbb               001                                  
2022091416483567002141420000731TE   OKDFDFJFSDFHS          0000000199137867325032383540                                                                
2022091419204463543285020000412TE   OKcccccccccc          0000000111113867351043007780 cccccccccc               1EP                                  
2022091419372363503707220000233TE   OKddddddddddd          0000000067822353828105648630 ddddddddddd               001

Explanation: I inform GNU AWK that row separator is zero-or-more (*) digits ([0-9]) followed by TE, then for line containing OK I print prev and current line ($0) where prev denotes row terminator (RT) of previous line, which is set after said printing. Disclaimer: my output has line with OKaaaaaaaaaa unlike desired output stipulated as I do not know what is logic behind it banishment, feel to adjust condition of printing action to take this into account.

(tested in gawk 4.2.1)

Upvotes: 3

Related Questions