Reputation: 149
I would like to compare two text strings if they are correct and matching provided that they are in 2 different columns in a csv file. I have the values
"DB Instance Identifier","Engine","Instance Type","MultiAZ","Reservation ID","State","Start Time","Reserved Engine","Reserved DB Instance Type"
"alpha-db-dev","postgres","db.t4g.micro",false,"alpha-db-dev-26aug2022","active","2022-08-26","postgresql","db.t4g.micro"
"alpha-db-prod","postgres","db.t4g.small",true,"alpha-db-prod-26aug2022","active","2022-08-26","postgresql","db.t4g.small"
"beta-db-dev-primary","aurora-postgresql","db.t3.medium",false,"beta-db-dev-primary-07sep2022","active","2022-09-07","postgresql","db.t3.medium"
"beta-db-dev-secondary","aurora-postgresql","db.t3.medium",false,"beta-db-dev-secondary-07sep2022","active","2022-09-07","postgresql","db.t3.small"
"charlie-db-prod","mysql","db.t3.small",true,"charlie-db-prod-07sep2022","active","2022-09-07","sql","db.t4g.medium"
in the example, the first row is a header where I would like to add the term "Engine match?" and "Instance Type Match?" which would be columns 10 and 11.
Then I would like the second row and succeeding ones to check if the value in column 2 and 8 matches. In the 2nd row and print true
in column 10 if it does, and false
if not.
I would also like to check if the values in column 3 and 9 are matching following the same logic above but print the result in column 11.
Desired Output:
"DB Instance Identifier","Engine","Instance Type","MultiAZ","Reservation ID","State","Start Time","Reserved Engine","Reserved DB Instance Class","Engine match?","Instance Type Match?"
"alpha-db-dev","postgres","db.t4g.micro",false,"alpha-db-dev-26aug2022","active","2022-08-26","postgresql","db.t4g.micro","true","true"
"alpha-db-prod","postgres","db.t4g.small",true,"alpha-db-prod-26aug2022","active","2022-08-26","postgresql","db.t4g.small","true","true"
"beta-db-dev-primary","aurora-postgresql","db.t3.medium",false,"beta-db-dev-primary-07sep2022","active","2022-09-07","postgresql","db.t3.medium","false","true"
"beta-db-dev-secondary","aurora-postgresql","db.t3.medium",false,"beta-db-dev-secondary-07sep2022","active","2022-09-07","postgresql","db.t3.small","true","false"
"charlie-db-prod","mysql","db.t3.small",true,"charlie-db-prod-07sep2022","active","2022-09-07","sql","db.t4g.medium","false","false"
This is a very bad attempt as I'm still learning the ropes of awk. Attempt:
NR == 1 { print $0, "\"Engine match?\",\"DB Instance Type Match?\""; next } #prints the column 10 and 11 header
{
min = $1 #skips the first line
for (i=2; i<=NF; i++) {
if ($10 != $11) { #I'm not sure how to call columns here
#I'm not sure how to put in the result true / false in column 10 and 11
}
}
print $0, min
}
Any help or idea would be great. I would also appreciate if the breakdown of the awk script would be given so I can study it further for another use case. Thanks!
Upvotes: 0
Views: 51
Reputation: 204477
for (i=2; i<=NF; i++)
loops over all fields from 2 to the end of the line. Your code compares $10
to $11
NF-1 times. That in combination with min = $1 #skips the first line
tells me you're confusing the NF
and NR
variables (see the man page for what each means) and also think you need to write a loop to process all the input lines - you don't, awk does that for you. Text-processing code in other languages looks like this pseudo-code:
while ( read line ) {
++NR
NF = 0
while ( field = next word in line ) {
fields[++NF] = field
}
print line
for ( i=1; i<=NF; i++ ) {
print NR, NF, fields[i]
}
}
but awk provides most of that for you and so the code you actually write in awk would just be:
print $0
for ( i=1; i<=NF; i++ ) {
print NR, NF, $i
}
Hope that helps with the basic awk understanding.
Now to your question -
To add 2 fields to your data after the 9th field would be:
{ $9 = $9 FS "" FS ""; $0=$0 }
the first row is a header where I would like to add the term "Engine match?" and "Instance Type Match?" which would be columns 10 and 11.
NR==1 { $10="\"Engine match?\""; $11="\"Instance Type Match?\"" }
Then I would like the second row and succeeding ones to check if the value in column 2 and 8 matches. In the 2nd row and print true in column 10 if it does, and false if not.
NR > 1 { $10 = ( $2 == $8 ? "true" : "false" ) }
I would also like to check if the values in column 3 and 9 are matching following the same logic above but print the result in column 11.
NR > 1 { $11 = ( $3 == $9 ? "true" : "false" ) }
All together:
awk '
BEGIN { FS=OFS="," }
{ $9 = $9 FS "" FS ""; $0=$0 }
NR==1 {
$10 = "\"Engine match?\""
$11 = "\"Instance Type Match?\""
}
NR > 1 {
$10 = ( $2 == $8 ? "true" : "false" )
$11 = ( $3 == $9 ? "true" : "false" )
}
{ print }
' file
The only non-obvious parts to the above are:
Look:
$ echo "foo" | awk '{for (i=1; i<=NF; i++) print NF, i, "<" $i ">"}'
1 1 <foo>
$ echo "foo" | awk '{$1=$1 FS "bar"; for (i=1; i<=NF; i++) print NF, i, "<" $i ">"}'
1 1 <foo bar>
Note above that there's still just 1 field in $0 after modifying $1, now "change" $0 to cause field splitting to reoccur:
$ echo "foo" | awk '{$1=$1 FS "bar"; $0=$0; for (i=1; i<=NF; i++) print NF, i, "<" $i ">"}'
2 1 <foo>
2 2 <bar>
and NOW we have an extra field.
So assigning to $9 doesn't add fields to $0, it just makes $9 contain more text. It's assigning to $0 afterwards that turns that new text we added into new fields. For this to work it's important that FS and OFS have the same value so that when assigning to a field (which triggers replacing all FSs with OFSs) we don't change what separates fields so that when we then assign to $0 (which triggers field splitting at each FS) we still have the same separator value in $0 as we started with.
Upvotes: 1