Reputation: 1299

Sum values in duplicate rows using Bash

I've seen different versions of this question on Stack Overflow but have not come across one that addresses this particular use case.

Goal

Find duplicate rows based on the entirety of the row (not one column only) except the value in the final column. Eliminate all but one of the duplicate rows, but first sum the values in the final column of every duplicate and show the resulting value in the final column of the remaining duplicate row. I would like to do this in Bash.

Use case

I have a table of each page in a website and the number of views it has received, as well as some other metadata. However some rows in the table represent the same page, only with different numbers of views. Those views need to be summed to show all-time views for each page.

Example

Original file:

url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",100
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",15
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",10
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",20
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",30

Desired file:

url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7

What I would like to do here is share each script iteration I've tried and break down what worked and didn't. This is so over my head that I'm struggling to do even that. My process has been to leverage portions of answers to similar Stack Overflow questions (all of which have been in awk, which makes sense to me) and change the compared column. But because some answers compare one column only, results of my modifications are inconsistent and strange. The scripts are complex enough that I'm struggling to understand why.

Is anyone able to provide education on how I might go about discovering an answer, or an example that points me in the right direction? Thank you if so.

Upvotes: 0

Answers (3)

karakfa

Reputation: 67507

another awk

$ awk -F, -v OFS=, 'NR==1 {print; next} 
                          {v=$NF; NF--; a[$0]+=v} 
                    END   {for(k in a) print k,a[k] | "sort"}' file

url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7

Explanation
print the header line; save the value (last field), reduce the number of fields so that remainder of the record becomes a key ($0), add the value to the accumulator with the key (will sum up equivalent key values). At the end print the key and values and sort.

Upvotes: 2

Ed Morton

Reputation: 203684

This will work whether or not any of the quoted fields can contain a , (e.g. if any of the fields with the "Page Title 1" placeholder text was actually something like "I, Robot - Page 1"):

$ awk '
    BEGIN { FS=OFS="," }
    NR==1 { print; next }
    { num=$NF; sub(/,[^,]*$/,""); sum[$0]+=num }
    END { for (key in sum) print key, sum[key] }
' file
url,title,tag,version,guide,views
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60

Upvotes: 4

Shawn

Reputation: 52439

One way using GNU datamash:

$ echo "url,title,tag,version,guide,views" && datamash --header-in -st, -g1,2,3,4,5 sum 6 < input.txt
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7

Or with awk:

$ awk -F, 'NR==1 { print; next }
           { groups[$1 "," $2 "," $3 "," $4 "," $5] += $6 }
           END { PROCINFO["sorted_in"] = "@ind_str_asc" # Sorted output when using GNU awk
                 for (g in groups) print g "," groups[g]
           }' input.txt
url,title,tag,version,guide,views
"https://website.com/1-1/section/product/page-title","Page Title 1",tag-1,"1-1","guide-1",115
"https://website.com/2-2/section/product/page-title","Page Title 2",tag-2,"2-2","guide-2",5
"https://website.com/3-3/section/product/page-title","Page Title 3",tag-3,"3-3","guide-3",60
"https://website.com/4-4/section/product/page-title","Page Title 4",tag-4,"4-4","guide-4",7