Reputation: 45134
I have a CSV file that I'm totaling up two ways: one using Excel and the other using awk
. Here are the totals of my first 8 columns in Excel:
1) 2640502474.00
2) 1272849386284.00
3) 36785.00
4)
5) 107.00
6) 239259.00
7) 0.00
8) 7418570893330.00
And here's my awk
output:
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$1} END {printf("%01.2f\n", s)}'
2640502474.00
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$2} END {printf("%01.2f\n", s)}'
1272849386284.00
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$8} END {printf("%01.2f\n", s)}'
7411306364347.00
Notice how 1 and 2 match exactly but 8 is off by many millions. I'm assuming Excel's total is the correct one, so why is awk
handling this file differently?
Upvotes: 3
Views: 273
Reputation: 8344
You likely have a comma formatted number contained in quotes. Excel will properly handle that number as a single field. Your regex for field separation in awk won't - a comma internal to a number is a valid separator according to that regex. It is very hard (and mostly futile) to try and handle optional nested escaping like what is possible in csv with a regex.
Compare the following to see what is likely going on:
$ echo '"1","10","15","1,000","14"' | awk -F "\"*,\"*" '{print $4}'
1
$ echo '"1","10","15","1,000","14"' | awk -F "\",\"" '{print $4}'
1,000
Note that the second regex above still has a problem with a trailing " in the last field and only works at all if all field are consistently quoted - it is for illustration purposes only.
Upvotes: 3