Excel and awk disagree about CSV totals

Question

I have a CSV file that I'm totaling up two ways: one using Excel and the other using awk. Here are the totals of my first 8 columns in Excel:

1) 2640502474.00
2) 1272849386284.00
3) 36785.00
4) 
5) 107.00
6) 239259.00
7) 0.00
8) 7418570893330.00

And here's my awk output:

$ cat /home/jason/import.csv | awk -F ""*,"*" '{s+=$1} END {printf("%01.2f
", s)}'
2640502474.00
$ cat /home/jason/import.csv | awk -F ""*,"*" '{s+=$2} END {printf("%01.2f
", s)}'
1272849386284.00
$ cat /home/jason/import.csv | awk -F ""*,"*" '{s+=$8} END {printf("%01.2f
", s)}'
7411306364347.00

Notice how 1 and 2 match exactly but 8 is off by many millions. I'm assuming Excel's total is the correct one, so why is awk handling this file differently?

Alex Stoddard · Accepted Answer

You likely have a comma formatted number contained in quotes. Excel will properly handle that number as a single field. Your regex for field separation in awk won't - a comma internal to a number is a valid separator according to that regex. It is very hard (and mostly futile) to try and handle optional nested escaping like what is possible in csv with a regex.

Compare the following to see what is likely going on:

$ echo '"1","10","15","1,000","14"' | awk -F "\"*,\"*" '{print $4}'
1
$ echo '"1","10","15","1,000","14"' | awk -F "\",\"" '{print $4}'
1,000

Note that the second regex above still has a problem with a trailing " in the last field and only works at all if all field are consistently quoted - it is for illustration purposes only.

Excel and awk disagree about CSV totals

Answers (1)

Related Questions