Dan Dyer
Dan Dyer

Reputation: 54475

Using Awk to process a file where each record has different fixed-width fields

I have some data files from a legacy system that I would like to process using Awk. Each file consists of a list of records. There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). The first two characters of the record indicate the type, from this you then know which fields should follow. A file might look something like this:

AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99

Using Gawk I can set the FIELDWIDTHS, but that applies to the whole file (unless I am missing some way of setting this on a record-by-record basis), or I can set FS to "" and process the file one character at a time, but that's a bit cumbersome.

Is there a good way to extract the fields from such a file using Awk?

Edit: Yes, I could use Perl (or something else). I'm still keen to know whether there is a sensible way of doing it with Awk though.

Upvotes: 4

Views: 3995

Answers (7)

markp-fuso
markp-fuso

Reputation: 34244

One awk idea using an array to keep track of the different FIELDWIDTHS formats:

awk '
BEGIN { fw["AA"] = "2 6 6 12"                     # predefined FIELDWIDTHS
        fw["BB"] = "2 6 6  6 18 6"
        fw["CC"] = "2 7"
      }
      { FIELDWIDTHS = fw[substr($0,1,2)]          # dynamically define FIELDWIDTHS based on 1st two characters
        $0 = $0                                   # force reparse of input line based on new FIELDWIDTHS
        print "#############",$0
        for (i=1;i<=NF;i++)
            print "field #"i,":",$i
      }
' input.txt

This generates:

############# AAField1Field2LongerField3
field #1 : AA
field #2 : Field1
field #3 : Field2
field #4 : LongerField3
############# BBField4Field5Field6VeryVeryLongField7Field8
field #1 : BB
field #2 : Field4
field #3 : Field5
field #4 : Field6
field #5 : VeryVeryLongField7
field #6 : Field8
############# CCField99
field #1 : CC
field #2 : Field99

Upvotes: 0

Rob Wells
Rob Wells

Reputation: 37103

Could you use Perl and then select an unpack template based on the first two chars of the line?

Upvotes: 3

Jonathan Leffler
Jonathan Leffler

Reputation: 753675

You probably need to suppress (or at least ignore) awk's built-in field separation code, and use a program along the lines of:

awk '/^AA/ { manually process record AA out of $0 }
     /^BB/ { manually process record BB out of $0 }
     /^CC/ { manually process record CC out of $0 }' file ...

The manual processing will be a bit fiddly - I suppose you'll need to use the substr function to extract each field by position, so what I've got as one line per record type will be more like one line per field in each record type, plus the follow-on printing.

I do think you might be better off with Perl and its unpack feature, but awk can handle it too, albeit verbosely.

Upvotes: 4

Darren Atkinson
Darren Atkinson

Reputation: 164

Hopefully this will lead you in the right direction. Assuming your multi-line records are guaranteed to be terminated by a 'CC' type row you can pre-process your text file using simple if-then logic. I have presumed you require fields1,5 and 7 on one row and a sample awk script would be.

BEGIN {
        field1=""
        field5=""
        field7=""
}
{
    record_type = substr($0,1,2)
    if (record_type == "AA")
    {
        field1=substr($0,3,6)
    }
    else if (record_type == "BB")
    {
        field5=substr($0,9,6)
        field7=substr($0,21,18)
    }
    else if (record_type == "CC")
    {
        print field1"|"field5"|"field7
    }
}

Create an awk script file called program.awk and pop that code into it. Execute the script using :

awk -f program.awk < my_multi_line_file.txt 

Upvotes: 8

Aleksey Otrubennikov
Aleksey Otrubennikov

Reputation: 1181

You maybe can use two passes:

1step.awk

/^AA/{printf "2 6 6 12"    }
/^BB/{printf "2 6 6 6 18 6"}
/^CC/{printf "2 8"         }
{printf "\n%s\n", $0}

2step.awk

NR%2 == 1 {FIELDWIDTHS=$0}
NR%2 == 0 {print $2}

And then

awk -f 1step.awk sample  | awk -f 2step.awk

Upvotes: 5

Zsolt Botykai
Zsolt Botykai

Reputation: 51593

What about 2 scripts? E.g. 1st script inserts field separators based on the first characters, then the 2nd should process it?

Or first of all define some function in your AWK script, which splits the lines into variables based on the input - I would go this way, for the possible re-usage.

Upvotes: 0

Petar Kabashki
Petar Kabashki

Reputation: 6628

Better use some fully featured scripting language like perl or ruby.

Upvotes: 0

Related Questions