VNA
VNA

Reputation: 625

awk running total count and sum

Would like to know how to calculate 80%-20% rule contribution of vendors on daily basis.

Input.csv

Date,Region,Vendor,Amount
5-Apr-15,east,cc,50
5-Apr-15,east,dd,15
5-Apr-15,south,bb,15
5-Apr-15,south,aa,10
7-Apr-15,east,cc,123
7-Apr-15,south,bb,88
7-Apr-15,south,aa,40
7-Apr-15,west,ss,30
7-Apr-15,west,rr,20

In the above input, based on $1 field need to populate Running Sum of Amount then calculate percentage of Running Sum of Amount for the day

Date,Region,Vendor,Amount,RunningSum,%RunningSum
5-Apr-15,east,cc,50,50,56%   (RunningSum=50 , %RunningSum=50/90(Total Amount for the day) 
5-Apr-15,east,dd,15,65,72%   (RunningSum=50+15, %RunningSum=65/90)
5-Apr-15,south,bb,15,80,89%  (RunningSum=65+15, %RunningSum=80/90)
5-Apr-15,south,aa,10,90,100% (RunningSum=80+10, %RunningSum=90/90)

Once it is derived 80% or first hit of 80%above need to consider as 80% contribution remaining line items need to be consider as 20% contribution.

Expected Output:

Date,Countof80%Vendor, SumOf80%Vendor, Countof20%Vendor, SumOf20%Vendor
5-Apr-15,3,80,1,10
7-Apr-15,3,251,2,50

Any suggestions ...

Upvotes: 0

Views: 1486

Answers (3)

amaurea
amaurea

Reputation: 5067

This is relatively complicated to do in awk, but here goes.

awk -F , '{vals[$1,++nums[$1]]=$4}END{
    for(d in nums){
        tot=0
        for(i=1;i<=nums[d];i++)tot+=vals[d,i]
        n=0;s=0
        for(i=1;i<=nums[d];i++){
            n++;s+=vals[d,i]
            if(s>=tot*0.8){s80=s;n80=n;s=0;n=0}
        }
        printf("%s,%d,%d,%d,%d\n",d,n80,s80,n,s)
    }
}' vendors.txt

This creates independent lists of values for each date, and after collecting all the data loops trough those lists and figures out where the 80% point is.

This assumes that the file contains only data (no "Date,Region,Vender.." header). If you want a one-line version for copy and paste purposes, here it is:

awk -F , '{vals[$1,++nums[$1]]=$4}END{for(d in nums){tot=0;for(i=1;i<=nums[d];i++)tot+=vals[d,i];n=0;s=0;for(i=1;i<=nums[d];i++){n++;s+=vals[d,i];if(s>=tot*0.8){s80=s;n80=n;s=0;n=0}};printf("%s,%d,%d,%d,%d\n",d,n80,s80,n,s)}}' vendors.txt

Edit: If you want counts separately for each region, you just need to change the $1 part into $1","$2, to concatenate date and region:

awk -F , '{a=$1","$2;vals[a,++nums[a]]=$4}END{for(d in nums){tot=0;for(i=1;i<=nums[d];i++)tot+=vals[d,i];n=0;s=0;for(i=1;i<=nums[d];i++){n++;s+=vals[d,i];if(s>=tot*0.8){s80=s;n80=n;s=0;n=0}};printf("%s,%d,%d,%d,%d\n",d,n80,s80,n,s)}}' vendors.txt

To be honest, though, I'm a bit unclear about exactly what you're trying to measure with this. For example, the number of vendors that make up 80% of the sales is not uniquely defined - it depends on the order you sort them in. If you sort all the small sales first, you will get a larger number than if you sort the large sales first. If you don't sort them in any particular order, you will get something intermediate. If what you want is how many of the biggest vendors make up 80% of the sales, then you will need to sort the array before counting.

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 203368

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR { tot[$1] += $NF; next }
FNR==1 { print $0, "RunningSum", "%RunningSum"; next }
{ sum[$1]+=$NF; print $0, sum[$1], int(0.5+sum[$1]*100/tot[$1])"%" }

$ awk -f tst.awk file file
Date,Region,Vendor,Amount,RunningSum,%RunningSum
5-Apr-15,east,cc,50,50,56%
5-Apr-15,east,dd,15,65,72%
5-Apr-15,south,bb,15,80,89%
5-Apr-15,south,aa,10,90,100%
7-Apr-15,east,cc,123,123,41%
7-Apr-15,south,bb,88,211,70%
7-Apr-15,south,aa,40,251,83%
7-Apr-15,west,ss,30,281,93%
7-Apr-15,west,rr,20,301,100%

I've provide the answer above for the first part of your question but I have no idea how the part starting from Once it is derived 80% relates to it. Is that something else you want output to a separate file given the same input? Appended to the above output? A different script? Something else?

Upvotes: 3

n0741337
n0741337

Reputation: 2514

Here's a awk script to do that as the dates change:

#!/usr/bin/awk -f

BEGIN {
    FS=OFS=","
    print "Date" OFS "Countof80%Vendor" OFS "SumOf80%Vendor" OFS "Countof20%Vendor" OFS "SumOf20%Vendor"
}

FNR==1 { next } # skip the header

last_1 != $1 && FNR > 2 {
    output( a, cnt, tot, last_1 )
    cnt = tot = 0
    delete( a )
}

{ tot += $4; a[++cnt] = tot; last_1 = $1 }

END { output( a, cnt, tot, last_1 ) }


func output( a, cnt, tot, last_date,                 perc80, i, runsum ) {
    perc80 = .8 * tot
    for(i=1; i<=cnt; i++) {
        runsum = a[i]
        if(runsum >= perc80) {
            print last_date OFS i OFS runsum OFS (cnt-i) OFS (tot-runsum)
            break
        }
    }
}

Here's the break-down:

  • BEGIN - set FS and OFS to ,, then print the header
  • FNR==1 - skip the header row
  • last_1 != $1 && FNR > 2 - whenever the $1 field changes values, print out the running totals by calling the function output. Afterwards, clear the cnt, tot variables and delete the array a.
  • For every non-header row, sum $4 in tot. Then set a[++cnt] = tot to capture both the running total at an index of the number of elements read so far. Lastly, reset the last_1 variable to $1 after all the other processing.
  • END - catch the final date grouping and print it with output

As for the function output:

  • perc80 - calc the 80% mark of the tot value.
  • Walk the a array elements using i and cnt.
  • store the value of a[i] into runsum for clarity/to reduce a[i] calls
  • if(runsum >= perc80) print an output line then break.

Running this script gives the following output:

Date,Countof80%Vendor,SumOf80%Vendor,Countof20%Vendor,SumOf20%Vendor
5-Apr-15,3,80,1,10
7-Apr-15,3,251,2,50

Upvotes: 3

Related Questions