Sum up values in first column for all identical strings in the second column

Question

Let's say I have the following 2 files with entries such as these (number, IP and User-agent):

30000 11.11.11.11 Dalvik/2.1.0 Linux
10000 22.22.22.22 GetintentCrawler getintent.com
 5000 33.33.33.33 Mozilla/5.0 X11; Linux i686 AppleWebKit/537.36 KHTML, like Gecko Chrome/43.0.2357.130 Safari/537.36
 3000 44.44.44.44 Mozilla/5.0 Macintosh; Intel Mac OS X 10_6_8 AppleWebKit/534.59.10 KHTML, like Gecko Version/5.1.9 Safari/534.59.10
 1000 55.55.55.55 Dalvik/1.6.0 Linux; U; Android 4.1.2; Orange Yumo Build/OrangeYumo

and

 6000 44.44.44.44 Mozilla/5.0 Macintosh; Intel Mac OS X 10_6_8 AppleWebKit/534.59.10 KHTML, like Gecko Version/5.1.9 Safari/534.59.10
 3000 33.33.33.33 Mozilla/5.0 X11; Linux i686 AppleWebKit/537.36 KHTML, like Gecko Chrome/43.0.2357.130 Safari/537.36
 2000 11.11.11.11 Dalvik/2.1.0 Linux
  600 55.55.55.55 Dalvik/1.6.0 Linux; U; Android 4.1.2; Orange Yumo Build/OrangeYumo
  500 22.22.22.22 GetintentCrawler getintent.com

I want to be able to sum up the first column for all identical IPs (the second column), while also keeping all the subsequent columns with the user-agent. Also, the final output should be sorted by first column.

So the result should basically look like this:

32000 11.11.11.11 Dalvik/2.1.0 Linux
10500 22.22.22.22 GetintentCrawler getintent.com
 9000 44.44.44.44 Mozilla/5.0 Macintosh; Intel Mac OS X 10_6_8 AppleWebKit/534.59.10 KHTML, like Gecko Version/5.1.9 Safari/534.59.10
 8000 33.33.33.33 Mozilla/5.0 X11; Linux i686 AppleWebKit/537.36 KHTML, like Gecko Chrome/43.0.2357.130 Safari/537.36
 1600 55.55.55.55 Dalvik/1.6.0 Linux; U; Android 4.1.2; Orange Yumo Build/OrangeYumo

So far I came up with this, but I lose the whole user-agent string and I also feel that I'm overcomplicating things:

cat file1.txt file2.txt file3.txt | awk '{arr[$2]+=$1;} END {for (i in arr) print i, arr[i]}' | awk '{ print $2" "$1 }' | sort -rn

anubhava · Accepted Answer

You can use this gnu-awk:

awk 'BEGIN{PROCINFO["sorted_in"]="@ind_num_asc"} {
   p=$1; $1=""; a[$0]+=p} END{for (i in a) print a[i] i}' file1 file2

BEGIN{PROCINFO["sorted_in"]="@ind_num_asc"} is used to maintain order of keys in associative array.

Output:

32000 11.11.11.11 Dalvik/2.1.0 Linux
10500 22.22.22.22 GetintentCrawler getintent.com
8000 33.33.33.33 Mozilla/5.0 X11; Linux i686 AppleWebKit/537.36 KHTML, like Gecko Chrome/43.0.2357.130 Safari/537.36
9000 44.44.44.44 Mozilla/5.0 Macintosh; Intel Mac OS X 10_6_8 AppleWebKit/534.59.10 KHTML, like Gecko Version/5.1.9 Safari/534.59.10
1600 55.55.55.55 Dalvik/1.6.0 Linux; U; Android 4.1.2; Orange Yumo Build/OrangeYumo

Sum up values in first column for all identical strings in the second column

Answers (1)

Related Questions