Peaceful_Warrior
Peaceful_Warrior

Reputation: 59

Linux command line to process CSV

I have the following CSV file:

data.csv

Chart #,Ticker,Industry,Last Price,Multiple
2,AFL,Accident & Health Insurance,60.9,0.82
3,UNM,Accident & Health Insurance,32.97,1.52
4,CNO,Accident & Health Insurance,19.33,2.59
2,OMC,Advertising Agencies,71.71,0.7
3,IPG,Advertising Agencies,21.24,2.35
4,ADS,Advertising Agencies,278.18,0.18
2,UPS,Air Delivery & Freight Services,103.8,0.48
3,FDX,Air Delivery & Freight Services,152.11,0.33
4,EXPD,Air Delivery & Freight Services,50.725,0.99
5,CHRW,Air Delivery & Freight Services,72.3,0.69
6,FWRD,Air Delivery & Freight Services,42.86,1.17

I'd like to use Awk or the best linux command line tool to make the date in the file look like this:

output.txt

Accident & Health Insurance
2*0.82,3*1.52,4*2.59

Advertising Agencies
2*0.7,3*2.35,4*0.18

Air Delivery & Freight Services
2*0.48,3*0.33,4*0.99,5*0.69,6*1.17

I'm basically taking all "Chart #" & multiplying them by the multiple and outputing the "Industry" on one line, all charts with multiples comma separated, then a blank space on a third line...then it processes the entire list.

Can someone point me in the right direction on how to do this? Would Awk be the best tool for this task or do I have to create a bash script to process it all?

Upvotes: -1

Views: 412

Answers (2)

Ed Morton
Ed Morton

Reputation: 203617

$ awk -F, -v OFS='\n' -v ORS='\n\n' '
    NR==1 { next }
    (NR>2) && ($3!=prevKey) { print prevKey, prevRec; prevRec="" }
    { prevKey=$3; prevRec=(prevRec==""?"":prevRec",") $1"*"$NF }
    END { print prevKey, prevRec }
' file
Accident & Health Insurance
2*0.82,3*1.52,4*2.59

Advertising Agencies
2*0.7,3*2.35,4*0.18

Air Delivery & Freight Services
2*0.48,3*0.33,4*0.99,5*0.69,6*1.17

The functional differences between the above and @A-Ray's answer are that:

  1. Mine assumes the file is sorted on $3 as shown in your sample input while A-Rays does not.
  2. Mine only stores the output string associated with one $3 value at a time in memory while A-Rays stores all the output strings for all $3 values at once.
  3. Mine prints the output in the order the $3 values occur in the input file while A-Rays prints them in "random" order (the order their indices are stored in the hash table).
  4. Mine prints a blank line between output records as shown in your expected output while A-Rays does not.

Upvotes: 4

bian
bian

Reputation: 1456

awk -F, '{a[$3]=a[$3]?a[$3]","$1"*"$NF:$1"*"$NF}END{for(i in a)print i"\n"a[i]}' filename
Air Delivery & Freight Services
2*0.48,3*0.33,4*0.99,5*0.69,6*1.17
Advertising Agencies
2*0.7,3*2.35,4*0.18
Accident & Health Insurance
2*0.82,3*1.52,4*2.59

Upvotes: 4

Related Questions