ComputerFellow
ComputerFellow

Reputation: 12108

How to merge lines and add column values?

So I have a laaaaaaaarge file like this:

Item|Cost1|Cost2
Pizza|50|25
Sugar|100|100
Spices|100|200
Pizza|100|25
Sugar|200|100
Pizza|50|100

I want to add all Cost1s and Cost2s for a particular item and produce a merged output.

I've written a python code to do this,

item_dict = {}
for line in file:
    fields = line.split('|')
    item = fields[0]
    cost1 = fields[1]
    cost2 = fields[2]
    if item_dict.has_key(item):
        item_dict[item][0] += int(cost1)
        item_dict[item][1] += int(cost2)
    else:
        item_dict[item] = [int(cost1),int(cost2)]

for key, val in item_dict.items():
    print key,"|".join(val)

Is there anyway to do this very efficiently and quickly in awk or using any other wizardry?

Or can I make my python more elegant and faster?

Expected Output

Pizza|200|150
Sugar|300|200
Spices|100|200

Upvotes: 2

Views: 280

Answers (3)

Ed Morton
Ed Morton

Reputation: 203209

awk '
    BEGIN { FS=OFS="|" }
    NR==1 { expectedNF = NF; next }
    NF != expectedNF { print "Fix your #%@#&! data, idiot!"; exit 1 }'
    {
        items[$1]
        for (c=2;c<=NF;c++)
            cost[$1,c] += $c
    } 
    END {
        for (i in items) {
            printf "%s", i
            for (c=2;c<=NF;c++)
                printf "%s%s", OFS, cost[i,c]
            print ""
        }
    }
' file

Feel free to compress it onto 1 or 2 lines as you see fit.

Upvotes: 4

Chris Seymour
Chris Seymour

Reputation: 85775

In practice I would have done what fedorqui did. For completeness however, this python script should be faster than your original:

#!/usr/bin/env python

import fileinput

item_dict = {}

for line in fileinput.input():
    if not fileinput.isfirstline():
        fields = line.strip().split('|')
        item = fields[0]
        cost1 = int(fields[1])
        cost2 = int(fields[2])
        try:
            item_dict[item][0] += cost1
            item_dict[item][1] += cost2
        except KeyError:
            item_dict[item] = [cost1, cost2]

for key, val in item_dict.items():
    print "%s|%s|%s" % (key,val[0],val[1])

Save the script to a file such as sumcols and make it executable chmod +x sumcols and run like:

$ ./sumcols file
Spices|100|200
Sugar|300|200
Pizza|200|150

Upvotes: 1

fedorqui
fedorqui

Reputation: 289505

Something like this...

$ awk 'BEGIN{OFS=FS="|"}
  NR>1 {cost1[$1]+=$2; cost2[$1]+=$3} 
  END{ for (i in cost1) print i, cost1[i], cost2[i]}' file
Sugar|300|200
Spices|100|200
Pizza|200|150

Explanation

  • BEGIN{OFS=FS="|"} sets the (input & output) field separator to be |.
  • NR>1 means that we are going to do some actions for line number bigger than 1. This way we skip the header.
  • cost1 and cost2 are arrays whose index is the first field and its value is the sum till that point.
  • END {} is something we do after reading the whole file. It consists in looping through the array and printing the values.

Upvotes: 10

Related Questions