Reputation: 177
I am a new GAWK user and have been able to figure out how to match patterns from several columns out of a 100 column report, but would like to learn how to simplify the code instead of appending the results of individual GAWK commands.
My current problem was to strip out header information, plus all instances of '/' characters in a variety of specific columns.
This reduced compound code example works:
gawk -F"|" '$1 ~ "Value" {print $1" | "$8" | "$51" | "$55}' "Q:\Report.txt" > c:\temp\Matches.txt && gawk -F"|" '$8 ~ "/" {print $1" | "$8" | "$51" | "$55}' "Q:\Report.txt" >> c:\temp\Matches.txt && gawk -F"|" '$51 ~ "/" {print $1" | "$8" | "$51" | "$55}' "Q:\Report.txt" >> c:\temp\Matches.txt && gawk -F"|" '$55 ~ "/" {print $1" | "$8" | "$51" | "$55}' "Q:\Report.txt" >> c:\temp\Matches.txt
Given my limited understanding on how to apply AWK expressions and syntax, attempts to simplify the above code (using multiple examples from various sources) have failed outright or in some ways.
The following code hangs:
gawk -F"|" '$1 ~ "Value" || $8 ~ "/" || $51 ~ "/" || $55 ~ "/" {print $1" | "$8" | "$51" | "$55}' "Q:\Report.txt"
After cancelling the process the following (and similar) error messages are produced:
'$8' is not recognized as an internal or external command, operable program or batch file.
This following code find the appropriate matches, but it includes all 100 columns from my source data instead of just the ones noted:
gawk -F"|" '$1 ~ "Value";$8 ~ "/";$51 ~ "/";$55 ~ "/" {print $1" | "$8" | "$51" | "$55}' "Q:\Report.txt" > c:\temp\Matches.txt
In addition to any suggestions on simplifying the above code, what GAWK resources or books have the most useful code examples?
Much thanks for any assistance.
Upvotes: 0
Views: 78
Reputation: 2807
since the 4 columns you're searching are also what you need in the end anyway, you can pre-trim the row first before doing a unified regex :
mawk '(NF =($2=$8 substr(_, $3=$51, $4=$58))^_+3)*/^[^|]*Value|\|.*\//' OFS=' | '
Value1 | /8 | /51 | /58
the substr()
is just a placeholder structure - it doesn't corrupt the data contained within $8
, while the multiplication of 1st half and regex
ensure regex
must evaluate to true before the new synthetic row is printed
Upvotes: 0
Reputation: 38771
You are apparently running on Windows CMD although you didn't say so. CMD does not implement quoting the same way as Unix shells do and in particular doesn't handle single-quote ' ... '
in the way you need. C implementations for Windows, and therefore ports of Unix-based C programs like gawk to such implementations, usually try to approximate Unix shell command-line handling as closely as they can, but it's not sufficient for this case.
If on supported Windows (at least 8 up) the simplest solution is to use PowerShell instead. (This is also the Microsoft-recommended method, FWTW.) PowerShell even at the lexical level relevant here does have some significant differences from Unix -- especially backquote/backtick -- but it should be close enough to handle this case. (At the higher 'cmdlet' level it is quite different, being based on structured data not text.)
An alternative solution, arguably more programmy and thus ontopic, is to use the fact awk like C treats booleans as the integers 0 and 1 and use arithmetic addition instead:
gawk -F"|" '($1 ~ "Value")+($8 ~ "/")+($51 ~ "/")+($55 ~ "/") {print $1" | "$8" | "$51" | "$55}' "Q:\Report.txt"
PS: I would also use OFS rather than retyping " | "
:
-vOFS=" | " '...{print $1,$8,$51,$55}'
Upvotes: 2