Reputation: 625
Would like to know how to calculate 80%-20% rule contribution of vendors on daily basis.
Input.csv
Date,Region,Vendor,Amount
5-Apr-15,east,cc,50
5-Apr-15,east,dd,15
5-Apr-15,south,bb,15
5-Apr-15,south,aa,10
7-Apr-15,east,cc,123
7-Apr-15,south,bb,88
7-Apr-15,south,aa,40
7-Apr-15,west,ss,30
7-Apr-15,west,rr,20
In the above input, based on $1 field need to populate Running Sum of Amount then calculate percentage of Running Sum of Amount for the day
Date,Region,Vendor,Amount,RunningSum,%RunningSum
5-Apr-15,east,cc,50,50,56% (RunningSum=50 , %RunningSum=50/90(Total Amount for the day)
5-Apr-15,east,dd,15,65,72% (RunningSum=50+15, %RunningSum=65/90)
5-Apr-15,south,bb,15,80,89% (RunningSum=65+15, %RunningSum=80/90)
5-Apr-15,south,aa,10,90,100% (RunningSum=80+10, %RunningSum=90/90)
Once it is derived 80% or first hit of 80%above need to consider as 80% contribution remaining line items need to be consider as 20% contribution.
Expected Output:
Date,Countof80%Vendor, SumOf80%Vendor, Countof20%Vendor, SumOf20%Vendor
5-Apr-15,3,80,1,10
7-Apr-15,3,251,2,50
Any suggestions ...
Upvotes: 0
Views: 1486
Reputation: 5067
This is relatively complicated to do in awk
, but here goes.
awk -F , '{vals[$1,++nums[$1]]=$4}END{
for(d in nums){
tot=0
for(i=1;i<=nums[d];i++)tot+=vals[d,i]
n=0;s=0
for(i=1;i<=nums[d];i++){
n++;s+=vals[d,i]
if(s>=tot*0.8){s80=s;n80=n;s=0;n=0}
}
printf("%s,%d,%d,%d,%d\n",d,n80,s80,n,s)
}
}' vendors.txt
This creates independent lists of values for each date, and after collecting all the data loops trough those lists and figures out where the 80% point is.
This assumes that the file contains only data (no "Date,Region,Vender.." header). If you want a one-line version for copy and paste purposes, here it is:
awk -F , '{vals[$1,++nums[$1]]=$4}END{for(d in nums){tot=0;for(i=1;i<=nums[d];i++)tot+=vals[d,i];n=0;s=0;for(i=1;i<=nums[d];i++){n++;s+=vals[d,i];if(s>=tot*0.8){s80=s;n80=n;s=0;n=0}};printf("%s,%d,%d,%d,%d\n",d,n80,s80,n,s)}}' vendors.txt
Edit: If you want counts separately for each region, you just need to change the $1 part into $1","$2, to concatenate date and region:
awk -F , '{a=$1","$2;vals[a,++nums[a]]=$4}END{for(d in nums){tot=0;for(i=1;i<=nums[d];i++)tot+=vals[d,i];n=0;s=0;for(i=1;i<=nums[d];i++){n++;s+=vals[d,i];if(s>=tot*0.8){s80=s;n80=n;s=0;n=0}};printf("%s,%d,%d,%d,%d\n",d,n80,s80,n,s)}}' vendors.txt
To be honest, though, I'm a bit unclear about exactly what you're trying to measure with this. For example, the number of vendors that make up 80% of the sales is not uniquely defined - it depends on the order you sort them in. If you sort all the small sales first, you will get a larger number than if you sort the large sales first. If you don't sort them in any particular order, you will get something intermediate. If what you want is how many of the biggest vendors make up 80% of the sales, then you will need to sort the array before counting.
Upvotes: 2
Reputation: 203368
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR { tot[$1] += $NF; next }
FNR==1 { print $0, "RunningSum", "%RunningSum"; next }
{ sum[$1]+=$NF; print $0, sum[$1], int(0.5+sum[$1]*100/tot[$1])"%" }
$ awk -f tst.awk file file
Date,Region,Vendor,Amount,RunningSum,%RunningSum
5-Apr-15,east,cc,50,50,56%
5-Apr-15,east,dd,15,65,72%
5-Apr-15,south,bb,15,80,89%
5-Apr-15,south,aa,10,90,100%
7-Apr-15,east,cc,123,123,41%
7-Apr-15,south,bb,88,211,70%
7-Apr-15,south,aa,40,251,83%
7-Apr-15,west,ss,30,281,93%
7-Apr-15,west,rr,20,301,100%
I've provide the answer above for the first part of your question but I have no idea how the part starting from Once it is derived 80%
relates to it. Is that something else you want output to a separate file given the same input? Appended to the above output? A different script? Something else?
Upvotes: 3
Reputation: 2514
Here's a awk script to do that as the dates change:
#!/usr/bin/awk -f
BEGIN {
FS=OFS=","
print "Date" OFS "Countof80%Vendor" OFS "SumOf80%Vendor" OFS "Countof20%Vendor" OFS "SumOf20%Vendor"
}
FNR==1 { next } # skip the header
last_1 != $1 && FNR > 2 {
output( a, cnt, tot, last_1 )
cnt = tot = 0
delete( a )
}
{ tot += $4; a[++cnt] = tot; last_1 = $1 }
END { output( a, cnt, tot, last_1 ) }
func output( a, cnt, tot, last_date, perc80, i, runsum ) {
perc80 = .8 * tot
for(i=1; i<=cnt; i++) {
runsum = a[i]
if(runsum >= perc80) {
print last_date OFS i OFS runsum OFS (cnt-i) OFS (tot-runsum)
break
}
}
}
Here's the break-down:
BEGIN
- set FS
and OFS
to ,
, then print the headerFNR==1
- skip the header rowlast_1 != $1 && FNR > 2
- whenever the $1
field changes values, print out the running totals by calling the function output
. Afterwards, clear the cnt
, tot
variables and delete the array a
.$4
in tot
. Then set a[++cnt] = tot
to capture both the running total at an index of the number of elements read so far. Lastly, reset the last_1
variable to $1
after all the other processing.END
- catch the final date grouping and print it with output
As for the function output
:
perc80
- calc the 80% mark of the tot
value.a
array elements using i
and cnt
.a[i]
into runsum
for clarity/to reduce a[i]
callsif(runsum >= perc80)
print an output line then break.Running this script gives the following output:
Date,Countof80%Vendor,SumOf80%Vendor,Countof20%Vendor,SumOf20%Vendor
5-Apr-15,3,80,1,10
7-Apr-15,3,251,2,50
Upvotes: 3