Reputation: 561
I have a question regarding to my previous successfully answered question here by @fedorgui.
I have a table:
pac1 xxx
pac1 yyy
pac1 zzz
pac2 xxx
pac2 uuu
pac3 zzz
pac3 uuu
pac4 zzz
And I need to calculate output like this:
pac1 xxx 2/4
pac1 yyy 1/4
pac1 zzz 3/4
pac2 xxx 2/4
pac2 uuu 2/4
pac3 zzz 2/4
pac3 uuu 2/4
pac4 zzz 3/4
Where first number is unique occurrences in column two / unique occurrences in column one (in this case xxx occurs 2 in column two and uniq column one is 4 => 2/4
Solution works in awk is here:
$ awk 'FNR==NR {col1[$1]++; col2[$2]++; next} {print $0, col2[$2] "/" length(col1)}' file file
But my input could have duplicated rows like:
pac1 xxx
pac1 xxx
pac1 xxx
pac1 yyy
pac1 zzz
pac2 xxx
pac2 xxx
pac2 xxx
pac2 uuu
pac3 zzz
pac3 uuu
pac4 zzz
pac4 zzz
And I need to do the same computations but only for uniq rows and add this statistic to all rows like (do not compute duplications rows):
pac1 xxx 2/4
pac1 xxx 2/4
pac1 xxx 2/4
pac1 yyy 1/4
pac1 zzz 3/4
pac2 xxx 2/4
pac2 xxx 2/4
pac2 xxx 2/4
pac2 uuu 2/4
pac3 zzz 2/4
pac3 uuu 2/4
pac4 zzz 3/4
pac4 zzz 3/4
This is more complicated I have thousands of rows. Thank you for any idea.
Upvotes: 1
Views: 258
Reputation: 11216
Just check if the line is unique when adding to the second array.
awk 'FNR==NR{a[$1];b[$2]+=!c[$1,$2]++;next}{print $0, b[$2] "/" length(a)}' test{,}
pac1 xxx 2/4
pac1 xxx 2/4
pac1 xxx 2/4
pac1 yyy 1/4
pac1 zzz 3/4
pac2 xxx 2/4
pac2 xxx 2/4
pac2 xxx 2/4
pac2 uuu 2/4
pac3 zzz 3/4
pac3 uuu 2/4
pac4 zzz 3/4
pac4 zzz 3/4
or if there aren't random spaces at the end of lines like your example you could just use $0
instead of $1,$2
Upvotes: 5