Reputation: 105
So I have this file, horribly formatted:
cat file1
220914230708E2022091416195068167642220000039TE OKaaaaaaaaaa 0000000017316354827094010600 aaaaaaaaaa 001 2022091416123467540807620001105TE OKbbbbbbbbbb 0000000292934354487119918680 bbbbbbbbbb 001 2022091416483567002141420000731TE OKDFDFJFSDFHS 0000000199137867325032383540 2022091419204463543285020000412TE OKcccccccccc 0000000111113867351043007780 cccccccccc 1EP 2022091419372363503707220000233TE OKddddddddddd 0000000067822353828105648630 ddddddddddd 001
And I would like to make it more readable.
I noticed the first field always ends up with "TE", so I tried this (and it almost worked).
awk ' BEGIN { RS = "TE" } { if ( $0 ~ "OK" ) print "TE" $0 }' file1
TE OKaaaaaaaaaa 0000000017316354827094010600 aaaaaaaaaa 001 2022091416123467540807620001105
TE OKbbbbbbbbbb 0000000292934354487119918680 bbbbbbbbbb 001 2022091416483567002141420000731
TE OKDFDFJFSDFHS 0000000199137867325032383540 2022091419204463543285020000412
TE OKcccccccccc 0000000111113867351043007780 cccccccccc 1EP 2022091419372363503707220000233
TE OKddddddddddd 0000000067822353828105648630 ddddddddddd 001
This is what I'm trying to accomplish:
2022091416123467540807620001105TE OKbbbbbbbbbb 0000000292934354487119918680 bbbbbbbbbb 001
2022091416483567002141420000731TE OKDFDFJFSDFHS 0000000199137867325032383540
2022091419204463543285020000412TE OKcccccccccc 0000000111113867351043007780 cccccccccc 1EP
2022091419372363503707220000233TE OKddddddddddd 0000000067822353828105648630 ddddddddddd 001
I know the problem is that I'm not correctly identifying the pattern, but I'm not sure what to do next.
Any help would be greatly appreciated.
Upvotes: 1
Views: 106
Reputation: 2807
if you're okay with 2 awk
calls with a pipe, then here's an awk
-based solution without needing custom functions
, loops
, arrays
, END
block, or vendor-proprietary solutions :
mawk 'gsub("[^ \t]+TE", "\n&") + sub("^\n[^E]+E",_)^_' | mawk NF=NF OFS='\t'
|
2022091416195068167642220000039TE OKaaaaaaaaaa 0000000017316354827094010600 aaaaaaaaaa 001
2022091416123467540807620001105TE OKbbbbbbbbbb 0000000292934354487119918680 bbbbbbbbbb 001
2022091416483567002141420000731TE OKDFDFJFSDFHS 0000000199137867325032383540
2022091419204463543285020000412TE OKcccccccccc 0000000111113867351043007780 cccccccccc 1EP
2022091419372363503707220000233TE OKddddddddddd 0000000067822353828105648630 ddddddddddd 001
Upvotes: 0
Reputation: 133428
With your shown samples please try following awk
code. Using awk
+ column
combination here to get the values in exact tabular format.
awk '
BEGIN{OFS="\t"}
{ $1=$1 }
1
' <(awk '{gsub(/[^[:space:]]+TE[[:space:]]+/,"\n&");sub(/^\n/,"")} 1' Input_file) |
column -t -s $'\t'
OR to test code with condition which is having to check line has OK
string or not try:
awk '
BEGIN{OFS="\t"}
/OK/{ $1=$1; print}
' <(awk '{gsub(/[^[:space:]]+TE[[:space:]]+/,"\n&");sub(/^\n/,"")} 1' Input_file) |
column -t -s $'\t'
Upvotes: 3
Reputation: 246744
I like @Daweo's answer best. An iterative solution that would work with any awk:
awk '
function emit() {
if (line ~ /OK/) print line
line = ""
}
{
for (i=1; i<=NF; i++) {
if (i > 1 && $i ~ /TE$/) emit()
line = line $i " "
}
emit()
}
' file | column -t
Upvotes: 2
Reputation: 36360
I would harness GNU AWK
for this task following way, let file.txt
content be
220914230708E2022091416195068167642220000039TE OKaaaaaaaaaa 0000000017316354827094010600 aaaaaaaaaa 001 2022091416123467540807620001105TE OKbbbbbbbbbb 0000000292934354487119918680 bbbbbbbbbb 001 2022091416483567002141420000731TE OKDFDFJFSDFHS 0000000199137867325032383540 2022091419204463543285020000412TE OKcccccccccc 0000000111113867351043007780 cccccccccc 1EP 2022091419372363503707220000233TE OKddddddddddd 0000000067822353828105648630 ddddddddddd 001
then
awk 'BEGIN{RS="[0-9]*TE"}/OK/{print prev, $0}{prev=RT}' file.txt
gives output
2022091416195068167642220000039TE OKaaaaaaaaaa 0000000017316354827094010600 aaaaaaaaaa 001
2022091416123467540807620001105TE OKbbbbbbbbbb 0000000292934354487119918680 bbbbbbbbbb 001
2022091416483567002141420000731TE OKDFDFJFSDFHS 0000000199137867325032383540
2022091419204463543285020000412TE OKcccccccccc 0000000111113867351043007780 cccccccccc 1EP
2022091419372363503707220000233TE OKddddddddddd 0000000067822353828105648630 ddddddddddd 001
Explanation: I inform GNU AWK
that row separator is zero-or-more (*
) digits ([0-9]
) followed by TE
, then for line containing OK
I print prev and current line ($0
) where prev denotes row terminator (RT
) of previous line, which is set after said printing. Disclaimer: my output has line with OKaaaaaaaaaa
unlike desired output stipulated as I do not know what is logic behind it banishment, feel to adjust condition of print
ing action to take this into account.
(tested in gawk 4.2.1)
Upvotes: 3