Fernando Bonetti
Fernando Bonetti

Reputation: 27

Use AWK to print FILENAME to CSV

I have a little script to compare some columns inside a bunch of CSV files. It's working fine, but there are some things that are bugging me.

Here is the code:

FILES=./*
for f in $FILES

do
    cat -v $f | sed "s/\^A/,/g" > op_tmp.csv
    awk -F, -vOFS=, 'NR == 1{next} $9=="T"{t[$8]+=$7;n[$8]} $9=="A"{a[$8]+=$7;n[$8]} $9=="C"{c[$8]+=$7;n[$8]} $9=="R"{r[$8]+=$7;n[$8]} $9=="P"{p[$8]+=$7;n[$8]} END{ for (i in n){print i "|" "A" "|" a[i]; print i "|" "C" "|" c[i]; print i "|" "R" "|" r[i]; print i "|" "P" "|" p[i]; print i "|" "T" "|" t[i] "|" (t[i]==a[i]+c[i]+r[i]+p[i] ? "ERROR" : "MATCHED")} }' op_tmp.csv >> output.csv
    rm op_tmp.csv
done

Just to explain: I get all files on the directory, then i use CAT to replace the divisor ^A for a Pipe |. Then i use the awk onliner to compare the columns i need and print the result to a output.csv.

But now i want to print the filename before every loop. I tried using the cat sed and awk in the same line and printing the $FILENAME, but it doesn't work:

cat -v $f | sed "s/\^A/,/g" | awk -F, -vOFS=, 'NR == 1{next} $9=="T"{t[$8]+=$7;n[$8]} $9=="A"{a[$8]+=$7;n[$8]} $9=="C"{c[$8]+=$7;n[$8]} $9=="R"{r[$8]+=$7;n[$8]} $9=="P"{p[$8]+=$7;n[$8]} END{ for (i in n){print i "|" "A" "|" a[i]; print i "|" "C" "|" c[i]; print i "|" "R" "|" r[i]; print i "|" "P" "|" p[i]; print i "|" "T" "|" t[i] "|" (t[i]==a[i]+c[i]+r[i]+p[i] ? "ERROR" : "MATCHED")} }' > output.csv

Can anyone help?

Upvotes: 1

Views: 743

Answers (2)

glenn jackman
glenn jackman

Reputation: 247042

A rewrite:

for f in ./*; do
    awk -F '\x01' -v OFS="|" '
        BEGIN { 
            letter[1]="A"; letter[2]="C"; letter[3]="R"; letter[4]="P"; letter[5]="T" 
            letters["A"] = letters["C"] = letters["R"] = letters["P"] = letters["T"] = 1
        }
        NR == 1 {next} 
        $9 in letters {
            count[$9,$8] += $7
            seen[$8]
        }
        END { 
            print FILENAME
            for (i in seen) {
                sum = 0
                for (j=1; j<=4; j++) {
                    print i, letter[j], count[letter[j],i]
                    sum += count[letter[j],i]
                }
                print i, "T", count["T",i], (count["T",i] == sum ? "ERROR" : "MATCHED")
            } 
        }
    ' "$f"
done > output.csv

Notes:

  • your method of iterating over files will break as soon as you have a filename with a space in it
  • try to reduce duplication as much as possible.
  • newlines are free, use them to improve readability
  • improve your variable names i, n, etc -- here "letter" and "letters" could use improvement to hold some meaning about those symbols.
  • awk has a FILENAME variable (here's the actual answer to your question)
  • awk understands \x01 to be a Ctrl-A -- I assume that's the field separator in your input files
  • define an Output Field Separator that you'll actually use

If you have GNU awk (version ???) you can use the ENDFILE block and do away with the shell for loop altogether:

gawk -F '\x01' -v OFS="|" '
    BEGIN {...}
    FNR == 1 {next}
    $9 in letters {...}
    ENDFILE {
        print FILENAME
        for ...
        # clean up the counters for the next file
        delete count
        delete seen
    }
' ./* > output.csv

Upvotes: 1

karakfa
karakfa

Reputation: 67527

You can rewrite the whole script better, but assuming it does what you want for now just add

echo $f >> output.csv

before awk call.

If you want to add filename in every awk output line, you have to pass it as an argument, i.e.

 awk ... -v fname="$f" '{...; print fname... etc

Upvotes: 1

Related Questions