Reputation: 1086
I have a file with the following structure:
1486113768 3656
1486113768 6280
1486113769 530912
1486113769 5629824
1486113770 5122176
1486113772 3565920
1486113772 530912
1486113773 9229920
1486113774 4020960
1486113774 4547928
My goal is to get rid of duplicate values in the first columns, sum the values in the second columns and update the row with new columns value: a working output, from the input above, would be:
1486113768 9936 # 3656 + 6280
1486113769 6160736 # 530912 + 5629824
1486113770 5122176 # ...
1486113772 4096832
1486113773 9229920
1486113774 8568888
I know cut
, uniq
: until now I managed to find the duplicate values in first columns with:
cut -d " " -f 1 file.log | uniq -d
1486113768
1486113769
1486113772
1486113774
Is there a "awk way" to achieve my goal? I know it is very powerful and terse tool: I used it earlier with
awk '{print $2 " " $3 >> $1".log"}' log.txt
to scan all rows in log.txt and create a .log file with $1 as name, and filling it with $2 and $3 values, all in one bash line (to hell with read
loop!); is there a way to find first column duplicates, sum its second column values and rewrite the row removing the duplicates and printing the resulting sum of second column?
Upvotes: 7
Views: 7686
Reputation: 203684
$ awk '$1!=p{ if (NR>1) print p, s; p=$1; s=0} {s+=$2} END{print p, s}' file
1486113768 9936
1486113769 6160736
1486113770 5122176
1486113772 4096832
1486113773 9229920
1486113774 8568888
The above uses almost no memory (just 1 string and 1 integer variables) and will print the output in the same order it appeared in your input.
I highly recommend you read the book Effective Awk Programming, 5th Edition, by Arnold Robbins if you're going to be using awk both so you can learn how to write your own scripts and (while you're learning) so you can understand other peoples scripts well enough to separate the right from the wrong approaches given 2 scripts that produce the expected output given some specific sample input.
Upvotes: 7
Reputation: 8711
Using Perl
$ cat elmazzun.log
1486113768 3656
1486113768 6280
1486113769 530912
1486113769 5629824
1486113770 5122176
1486113772 3565920
1486113772 530912
1486113773 9229920
1486113774 4020960
1486113774 4547928
$ perl -lane ' $kv{$F[0]}+=$F[1];END { print "$_ $kv{$_}" for (sort keys %kv)}' elmazzun.log
1486113768 9936
1486113769 6160736
1486113770 5122176
1486113772 4096832
1486113773 9229920
1486113774 8568888
$
Upvotes: 0
Reputation: 11
Say you have a top ten lines from many log files output concatened in one file (and sorted with 'sort') with that kind of results :
2142 /pathtofile1/00.jpg
2173 /pathtofile1/00.jpg
2100 /pathtofile1/00.jpg
2127 /pathtofile1/00.jpg
you can also change the order of sum:
$ awk '{ seen[$2] += $1 } END { for (i in seen) print i, seen[i] }' top10s.txt | sort -k 2 -rn
and you'll get that total:
/pathtofile1/00.jpg 8542
Upvotes: 0
Reputation: 85693
Use an Awk
as below,
awk '{ seen[$1] += $2 } END { for (i in seen) print i, seen[i] }' file1
1486113768 9936
1486113769 6160736
1486113770 5122176
1486113772 4096832
1486113773 9229920
1486113774 8568888
{seen[$1]+=$2}
creates a hash-map with the $1
being treated as the index value and the sum is incremented only for those unique items from $1
in the file.
Upvotes: 16
Reputation: 23667
If datamash is okay
$ datamash -t' ' -g 1 sum 2 < ip.txt
1486113768 9936
1486113769 6160736
1486113770 5122176
1486113772 4096832
1486113773 9229920
1486113774 8568888
-t' '
set space as field delimiter-g 1
group by 1st fieldsum 2
sum 2nd field valuesdatamash -st' ' -g 1 sum 2
where the -s
option takes care of sortingUpvotes: 3