user2605844
user2605844

Reputation: 11

count and print the number of occurences

I have some files as shown below

GLL  ALM  654-656  654  656 
SEM  LYG  655-657  655  657
SEM  LYG  655-657  655  657 
ALM  LEG  656-658  656  658 
ALM  LEG  656-658  656  658  
ALM  LEG  656-658  656  658  
LEG  LEG  658-660  658  660 
LEG  LEG  658-660  658  660

The value of GLL is 654. The value of ALM is 656. In the same way, 4th column represents the values of first column. 5th column represents the values of second column.I would like to count the unique occurrences of each number in the fourth and fifth column.

Desired output

654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 1
658 LEG 2
660 LEG 1

Upvotes: 0

Views: 249

Answers (3)

user539810
user539810

Reputation:

Sorry it is so long, but it works and has a bonus built in if such a thing occurred! See edit 2 for more info. :-)

awk '
BEGIN { SUBSEP = FS;
    before = 0;
    between = 1;
    after = 0;
}

{
    offset = int((NF - after - before - between) / 2) + between;
    for (i=1 + before; i <= offset + before - between; i++) {
        j = i + offset;
        if (! ((i, $j, $i) in entry))
            entry[i, $j, $i]++;
    }
}

END {
    for (item in entry) {
        split(item, itema);
        entry[itema[2], itema[3]]++;
        delete entry[item];
    }
    for (item in entry)
        print item, entry[item];
}' filename | sort -n

The first part filters the input, only accepting unique occurrences of the pair that should be in the first and second columns of the output. The second part combines the results, adding 1 for each occurrence in a unique column (e.g. LEG,658 appears at least once in both $1,$4 and $2,$5, so it is counted twice), and prints the results, which is passed to the sort utility to sort the output numerically.

It is generalized for N pairs, so if you have something like the following in the future, the script still works, so long as only pairs are added (you can't add another separate field, or the script breaks):

GLL  ALM  LEG  654-660  654  656  660

I suppose if you wanted, you could add extra fields to the beginning and change the start value of i. Or maybe add at the end and subtract one more from the end value of i for each new field you add (e.g. NF - 2 if you add 1 one more unpaired field at the end). It would require a redesign to accommodate unpaired values in the middle because the data set would be completely different.

Edit It's only so long because it is flexible (somewhat) and because I'm still an awk newbie. I'd recommend Kent's if you don't like mine (or it doesn't work--I'm not using a computer that has awk installed at the moment).

Edit 2 Updated script. It didn't work before, and it can now handle arbitrary offsets so long as no unpaired fields split the pairs up. Something like the following works:

GLL ALM LYG 654-657 654 656 657
SEM LYG 655-657 655 657
SEM LYG LEG 655-660 655 657 660
ALM LEG 656-658 656 658
LEG LEG 658-660 658 660
LYG LEG 657-660 657 660

Output:

654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 3
658 LEG 2
660 LEG 2

Edit 3 The script now handles arbitrary contiguous unpaired fields. You must configure how many fields you have before the first part of a pair begins (e.g. how many fields before the first GLL, ALM, etc. on the line), how many fields are between the first and second parts of the pairs, and how many fields are after the list of second parts of the pairs. Note that it must be contiguous and consistent, meaning you can't have something like 1 field before the first pair start component for one line and 5 fields before the first pair start component on another line, and you can't have a pair start/end component separated from another of the same (e.g. "GLL xyz ALM 654 656" doesn't work because "xyz" separates "GLL" and "ALM", which are both pair start components).

For anything more than this, actual knowledge about the data set would be required, such as if GLL may have extra information immediately after it, but ALM does not ever have such data.

Upvotes: 0

glenn jackman
glenn jackman

Reputation: 246764

My take:

sort -u file |
awk '
    BEGIN {SUBSEP = OFS}
    {count[$4,$1]++; count[$5,$2]++}
    END {for (key in count) print key, count[key]}
' |
sort -n
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 1
658 LEG 2
660 LEG 1

Upvotes: 0

Kent
Kent

Reputation: 195039

If I understand your question right, this script could give you the output:

awk '{d[$4]=$1;d[$5]=$2;p[$4];l[$5]}
END{
    for(k in p){
        if (k in l){
            delete l[k]
            print k,d[k],"2"
        }else
        print k,d[k],"1"
    }
    for (k in l)
        print k, d[k],1
} ' file

with your input data, the output of above script:

654 GLL 1
655 SEM 1
656 ALM 2
658 LEG 2
657 LYG 1
660 LEG 1

so it is not 100% same as your expected output (the order), but if you pipe it to sort -n, it is gonna give you the exactly same thing. The sorting part could be done within the awk too. I was a bit lazy... :)

Upvotes: 1

Related Questions