Hafiz Muhammad Shafiq
Hafiz Muhammad Shafiq

Reputation: 8670

AWK FPAT not working as expected for string parsing

I have to parse a very large length string (from stdin). It is basically a .sql file. I have to get data from it. I am working to parse the data so that I can convert it into csv. For this, I am using awk. For my case, A sample snippet (of two records) is as follows:

b="([email protected],www.example.com,'field2,(2)'),([email protected],www.example.com,'field0'),"
echo $b|awk 'BEGIN {FPAT = "([^\\)]+)|('\''[^'\'']+'\'')"}{print $1}'

In my regex, I am saying that split on ")" bracket or if single quotes are found then ignore all text until last quote is found. But my output is as follows:

([email protected],www.example.com,'field2,(2

I am expecting this output

([email protected],www.example.com,'field2,(2)'

Where is the problem in my code. I am search a lot and check awk manual for this but not successful.

Upvotes: 0

Views: 523

Answers (4)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2815

if you wanna do it close to one pass, maybe try this

{mawk/mawk2/gawk} 'BEGIN { OFS = FS = "\047"; ORS = RS = "\n";

        XFS = "\376\004\377"; 
        XRS = "\051" ORS;
    
    } ! /[\051]/ { print; next; } { for (x=1; x <= NF; x += 2) { 

        gsub(/[\051][^\050]*/, XFS, $(x)); } } gsub(XFS, XRS) || 1'

I did it this way with 2 gsubs just in case it starts sending rows below with unintended consequences. \051 = ")", \050 is the open one.

  • further enhanced it by telling it to instantly print and move on if no close brackets are even found (so nothing to split at all)

It only loops over odd-numbered fields once i split it by the single quote \047 (cuz even numbered ones are precisely the ones within a pair of single quotes you want to avoid chopping at).

As for XFS, just pick any combination of your choice using bytes that are almost impossible to encounter. If you want to play it safe, you can test for whether XFS exists in that row, and use some alternative combo. It's basically to insert a delimiter into the middle of the row that wouldn't run afoul with actual input data. It's not fool proof per se, but the likelihood of running into a combination of UTF16 Byte order mark and ASCII control characters is reasonably low.

(and if you encounter XFS, it's likely you already have corrupted data to begin with, since a 300 series octal must be followed by 200 series ones to be valid UTF8)

This way, i wouldn't need FPAT at all.

*updated with " || 1" towards the end as a safety catch-all, but shouldn't really be needed.

Upvotes: 1

anubhava
anubhava

Reputation: 784948

Similar regex approach as Ed has suggested but I usually prefer using RS and RT over FPAT:

b="([email protected],www.example.com,'field2,(2)'),([email protected],www.example.com,'field0'),"
awk -v RS="[(]('[^']*'|[^)])*[)]" 'RT {print RT}' <<< "$b"
([email protected],www.example.com,'field2,(2)')
([email protected],www.example.com,'field0')

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203209

My first answer below was wrong, there is an ERE for what you're trying to do:

$ echo "$b" | awk -v FPAT="[(]([^)]|'[^']*')*)" '{for (i=1; i<=NF; i++) print $i}'
([email protected],www.example.com,'field2,(2)')
([email protected],www.example.com,'field0')

Original answer, left as a different approach:

You need a 2-pass approach first to replace all )s within quoted fields with something that can't already exist in the input (e.g. RS) and then to identify the (...) fields and put the RSs back to )s before printing them:

$ echo "$b" |
awk -F"'" -v OFS= '
    {
        for (i=2; i<=NF; i+=2) {
            gsub(/)/,RS,$i)
            $i = FS $i FS
        }
        FPAT = "[(][^)]*)"
        $0 = $0
        for (i=1; i<=NF; i++) {
            gsub(RS,")",$i)
            print $i
        }
        FS = FS
    }
'
([email protected],www.example.com,'field2,(2)')
([email protected],www.example.com,'field0')

The above is gawk-only due to FPAT (or we could have used gawk patsplit()), with other awks you'd used a while-match()-substr() loop:

$ echo "$b" |
awk -F"'" -v OFS= '
    {
        for (i=2; i<=NF; i+=2) {
            gsub(/)/,RS,$i)
            $i = FS $i FS
        }
        while ( match($0,/[(][^)]*)/) ) {
            field = substr($0,RSTART,RLENGTH)
            gsub(RS,")",field)
            print field
            $0 = substr($0,RSTART+RLENGTH)
        }
    }
'
([email protected],www.example.com,'field2,(2)')
([email protected],www.example.com,'field0')

Upvotes: 3

RavinderSingh13
RavinderSingh13

Reputation: 133428

Written and tested with your shown samples in GNU awk. This could be done in simple field separator setting, try following once, where b is your shell variable which has your shown value in it.

echo "$b" | awk -F'\\),\\(' '{print $1}'
([email protected],www.example.com,'field2,(2)'

Explanation: Simply setting field separator of awk program to \\),\\( for your input and printing first field of it.

Upvotes: 2

Related Questions