Gifter Villanueva
Gifter Villanueva

Reputation: 149

comparing text matches in awk in different columns

I would like to compare two text strings if they are correct and matching provided that they are in 2 different columns in a csv file. I have the values

"DB Instance Identifier","Engine","Instance Type","MultiAZ","Reservation ID","State","Start Time","Reserved Engine","Reserved DB Instance Type"
"alpha-db-dev","postgres","db.t4g.micro",false,"alpha-db-dev-26aug2022","active","2022-08-26","postgresql","db.t4g.micro"
"alpha-db-prod","postgres","db.t4g.small",true,"alpha-db-prod-26aug2022","active","2022-08-26","postgresql","db.t4g.small"
"beta-db-dev-primary","aurora-postgresql","db.t3.medium",false,"beta-db-dev-primary-07sep2022","active","2022-09-07","postgresql","db.t3.medium"
"beta-db-dev-secondary","aurora-postgresql","db.t3.medium",false,"beta-db-dev-secondary-07sep2022","active","2022-09-07","postgresql","db.t3.small"
"charlie-db-prod","mysql","db.t3.small",true,"charlie-db-prod-07sep2022","active","2022-09-07","sql","db.t4g.medium"

in the example, the first row is a header where I would like to add the term "Engine match?" and "Instance Type Match?" which would be columns 10 and 11.

Then I would like the second row and succeeding ones to check if the value in column 2 and 8 matches. In the 2nd row and print true in column 10 if it does, and false if not.

I would also like to check if the values in column 3 and 9 are matching following the same logic above but print the result in column 11.

Desired Output:

"DB Instance Identifier","Engine","Instance Type","MultiAZ","Reservation ID","State","Start Time","Reserved Engine","Reserved DB Instance Class","Engine match?","Instance Type Match?"
"alpha-db-dev","postgres","db.t4g.micro",false,"alpha-db-dev-26aug2022","active","2022-08-26","postgresql","db.t4g.micro","true","true"
"alpha-db-prod","postgres","db.t4g.small",true,"alpha-db-prod-26aug2022","active","2022-08-26","postgresql","db.t4g.small","true","true"
"beta-db-dev-primary","aurora-postgresql","db.t3.medium",false,"beta-db-dev-primary-07sep2022","active","2022-09-07","postgresql","db.t3.medium","false","true"
"beta-db-dev-secondary","aurora-postgresql","db.t3.medium",false,"beta-db-dev-secondary-07sep2022","active","2022-09-07","postgresql","db.t3.small","true","false"
"charlie-db-prod","mysql","db.t3.small",true,"charlie-db-prod-07sep2022","active","2022-09-07","sql","db.t4g.medium","false","false"

This is a very bad attempt as I'm still learning the ropes of awk. Attempt:

    NR == 1 { print $0, "\"Engine match?\",\"DB Instance Type Match?\""; next } #prints the column 10 and 11 header
    {
        min = $1 #skips the first line
        for (i=2; i<=NF; i++) {
            if ($10 != $11) { #I'm not sure how to call columns here
                #I'm not sure how to put in the result true / false in column 10 and 11
            }
        }
        print $0, min
    }

Any help or idea would be great. I would also appreciate if the breakdown of the awk script would be given so I can study it further for another use case. Thanks!

Upvotes: 0

Views: 51

Answers (1)

Ed Morton
Ed Morton

Reputation: 204477

for (i=2; i<=NF; i++) loops over all fields from 2 to the end of the line. Your code compares $10 to $11 NF-1 times. That in combination with min = $1 #skips the first line tells me you're confusing the NF and NR variables (see the man page for what each means) and also think you need to write a loop to process all the input lines - you don't, awk does that for you. Text-processing code in other languages looks like this pseudo-code:

while ( read line ) {
    ++NR
    NF = 0
    while ( field = next word in line ) {
       fields[++NF] = field
    }
    print line
    for ( i=1; i<=NF; i++ ) {
        print NR, NF, fields[i]
    }
}

but awk provides most of that for you and so the code you actually write in awk would just be:

    print $0
    for ( i=1; i<=NF; i++ ) {
        print NR, NF, $i
    }

Hope that helps with the basic awk understanding.

Now to your question -

To add 2 fields to your data after the 9th field would be:

{ $9 = $9 FS "" FS ""; $0=$0 }

the first row is a header where I would like to add the term "Engine match?" and "Instance Type Match?" which would be columns 10 and 11.

NR==1 { $10="\"Engine match?\""; $11="\"Instance Type Match?\"" }

Then I would like the second row and succeeding ones to check if the value in column 2 and 8 matches. In the 2nd row and print true in column 10 if it does, and false if not.

NR > 1 { $10 = ( $2 == $8 ? "true" : "false" ) }

I would also like to check if the values in column 3 and 9 are matching following the same logic above but print the result in column 11.

NR > 1 { $11 = ( $3 == $9 ? "true" : "false" ) }

All together:

awk '
    BEGIN { FS=OFS="," }
    { $9 = $9 FS "" FS ""; $0=$0 }
    NR==1 {
        $10 = "\"Engine match?\""
        $11 = "\"Instance Type Match?\""
    }
    NR > 1 {
        $10 = ( $2 == $8 ? "true" : "false" )
        $11 = ( $3 == $9 ? "true" : "false" )
    }
    { print }
' file

The only non-obvious parts to the above are:

  1. Assigning a value to or otherwise modifying $9 (or any field) causes awk to reconstruct $0, replacing all FSs with OFSs
  2. Assigning to or otherwise modifying $0 causes awk to re-split $0 into fields.

Look:

$ echo "foo" | awk '{for (i=1; i<=NF; i++) print NF, i, "<" $i ">"}'
1 1 <foo>

$ echo "foo" | awk '{$1=$1 FS "bar"; for (i=1; i<=NF; i++) print NF, i, "<" $i ">"}'
1 1 <foo bar>

Note above that there's still just 1 field in $0 after modifying $1, now "change" $0 to cause field splitting to reoccur:

$ echo "foo" | awk '{$1=$1 FS "bar"; $0=$0; for (i=1; i<=NF; i++) print NF, i, "<" $i ">"}'
2 1 <foo>
2 2 <bar>

and NOW we have an extra field.

So assigning to $9 doesn't add fields to $0, it just makes $9 contain more text. It's assigning to $0 afterwards that turns that new text we added into new fields. For this to work it's important that FS and OFS have the same value so that when assigning to a field (which triggers replacing all FSs with OFSs) we don't change what separates fields so that when we then assign to $0 (which triggers field splitting at each FS) we still have the same separator value in $0 as we started with.

Upvotes: 1

Related Questions