Reputation: 11
I have some files as shown below
GLL ALM 654-656 654 656
SEM LYG 655-657 655 657
SEM LYG 655-657 655 657
ALM LEG 656-658 656 658
ALM LEG 656-658 656 658
ALM LEG 656-658 656 658
LEG LEG 658-660 658 660
LEG LEG 658-660 658 660
The value of GLL is 654. The value of ALM is 656. In the same way, 4th column represents the values of first column. 5th column represents the values of second column.I would like to count the unique occurrences of each number in the fourth and fifth column.
Desired output
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 1
658 LEG 2
660 LEG 1
Upvotes: 0
Views: 249
Reputation:
Sorry it is so long, but it works and has a bonus built in if such a thing occurred! See edit 2 for more info. :-)
awk '
BEGIN { SUBSEP = FS;
before = 0;
between = 1;
after = 0;
}
{
offset = int((NF - after - before - between) / 2) + between;
for (i=1 + before; i <= offset + before - between; i++) {
j = i + offset;
if (! ((i, $j, $i) in entry))
entry[i, $j, $i]++;
}
}
END {
for (item in entry) {
split(item, itema);
entry[itema[2], itema[3]]++;
delete entry[item];
}
for (item in entry)
print item, entry[item];
}' filename | sort -n
The first part filters the input, only accepting unique occurrences of the pair that should be in the first and second columns of the output. The second part combines the results, adding 1 for each occurrence in a unique column (e.g. LEG,658 appears at least once in both $1,$4 and $2,$5, so it is counted twice), and prints the results, which is passed to the sort utility to sort the output numerically.
It is generalized for N pairs, so if you have something like the following in the future, the script still works, so long as only pairs are added (you can't add another separate field, or the script breaks):
GLL ALM LEG 654-660 654 656 660
I suppose if you wanted, you could add extra fields to the beginning and change the start value of i. Or maybe add at the end and subtract one more from the end value of i for each new field you add (e.g. NF - 2 if you add 1 one more unpaired field at the end). It would require a redesign to accommodate unpaired values in the middle because the data set would be completely different.
Edit It's only so long because it is flexible (somewhat) and because I'm still an awk newbie. I'd recommend Kent's if you don't like mine (or it doesn't work--I'm not using a computer that has awk installed at the moment).
Edit 2 Updated script. It didn't work before, and it can now handle arbitrary offsets so long as no unpaired fields split the pairs up. Something like the following works:
GLL ALM LYG 654-657 654 656 657
SEM LYG 655-657 655 657
SEM LYG LEG 655-660 655 657 660
ALM LEG 656-658 656 658
LEG LEG 658-660 658 660
LYG LEG 657-660 657 660
Output:
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 3
658 LEG 2
660 LEG 2
Edit 3 The script now handles arbitrary contiguous unpaired fields. You must configure how many fields you have before the first part of a pair begins (e.g. how many fields before the first GLL, ALM, etc. on the line), how many fields are between the first and second parts of the pairs, and how many fields are after the list of second parts of the pairs. Note that it must be contiguous and consistent, meaning you can't have something like 1 field before the first pair start component for one line and 5 fields before the first pair start component on another line, and you can't have a pair start/end component separated from another of the same (e.g. "GLL xyz ALM 654 656" doesn't work because "xyz" separates "GLL" and "ALM", which are both pair start components).
For anything more than this, actual knowledge about the data set would be required, such as if GLL may have extra information immediately after it, but ALM does not ever have such data.
Upvotes: 0
Reputation: 246764
My take:
sort -u file |
awk '
BEGIN {SUBSEP = OFS}
{count[$4,$1]++; count[$5,$2]++}
END {for (key in count) print key, count[key]}
' |
sort -n
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 1
658 LEG 2
660 LEG 1
Upvotes: 0
Reputation: 195039
If I understand your question right, this script could give you the output:
awk '{d[$4]=$1;d[$5]=$2;p[$4];l[$5]}
END{
for(k in p){
if (k in l){
delete l[k]
print k,d[k],"2"
}else
print k,d[k],"1"
}
for (k in l)
print k, d[k],1
} ' file
with your input data, the output of above script:
654 GLL 1
655 SEM 1
656 ALM 2
658 LEG 2
657 LYG 1
660 LEG 1
so it is not 100% same as your expected output (the order), but if you pipe it to sort -n
, it is gonna give you the exactly same thing. The sorting part could be done within the awk too. I was a bit lazy... :)
Upvotes: 1