Reputation: 115
I have a text file like this small example:
>ENST00000599533.1|ENSG00000269831.1|-|-|AL669831.1-201|AL669831.1|43
FFYFIIWSLTLLPRAGLELLTSSDPPASASQSVGITGVSHHAQ
>ENST00000594233.1|ENSG00000269308.1|-|-|AL645608.2-201|AL669831.1|18
DFMHLFFIPSSELILPYP
>ENST00000420190.1|ENSG00000187634.6|OTTHUMG00000040719.8|OTTHUMT00000316521.1|SAMD11-011|SAMD11|179
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSR
>ENST00000437963.1|ENSG00000187634.6|OTTHUMG00000040719.8|OTTHUMT00000097862.3|SAMD11-003|SAMD11|109
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPT
this file has many parts and each part has 2 lines. the first line is ID starting with ">"
and the 2nd line is a sequence of characters.
in the ID line, the fields are "|" separated and looking at the 6th column there are many repeats of each ID. I want to make a new file from this one in which there is only one repeat of each ID and that is the one with the biggest number in column 7. so in fact in the new file I would have only one repeat of each ID (according to the 6th column) and that is the ID with the highest number (among the IDs with similar name in column6) in column 7.
the expected output for the small example would be:
>ENST00000599533.1|ENSG00000269831.1|-|-|AL669831.1-201|AL669831.1|43
FFYFIIWSLTLLPRAGLELLTSSDPPASASQSVGITGVSHHAQ
>ENST00000420190.1|ENSG00000187634.6|OTTHUMG00000040719.8|OTTHUMT00000316521.1|SAMD11-011|SAMD11|179
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSR
to make this file I wrote this code:
awk -F"|" ' /^>/{(array1[val]=array[val]>length($0)) print array1} Input.txt > out.txt
but it does not return anything. do you know how to fix it to get the expected output?
Upvotes: 0
Views: 84
Reputation: 67497
if the repeated IDs are contiguous...
$ awk -F'|' -v RS='>' -v ORS='' 'NR>1 && p!=$6 {max=$7; r=rt $0; p=$6; print r; r=""}
p==$6 && max<$7 {max=$7; r=rt $0}
{rt=RT}
END {if(r) print r}' file
>ENST00000599533.1|ENSG00000269831.1|-|-|AL669831.1-201|AL669831.1|43
FFYFIIWSLTLLPRAGLELLTSSDPPASASQSVGITGVSHHAQ
>ENST00000420190.1|ENSG00000187634.6|OTTHUMG00000040719.8|OTTHUMT00000316521.1|SAMD11-011|SAMD11|179
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSR
Upvotes: 1
Reputation: 133518
EDIT: Adding solution as per OP's need which will provide output in same order in which IDs are coming in Input_file itself.
awk -F"|" '
/^>/ && !d[$6]++{
f[++count]=$6
}
/^>/{
c[$6]=$0;
a[$6]++;
val=$6;
getline;
b[val]=length(b[val])>length($0)?(b[val]?b[val]:$0):$0
}
END{
for(i=1;i<=count;i++){
if(a[f[i]]>1){
print c[f[i]] ORS b[f[i]]}
}}' Input_file
If you are not worried about sequence of output(like it should be same as Input_file's sequence) then following may help you here.
awk -F"|" '
/^>/{
c[$6]=$0;
a[$6]++;
val=$6;
getline;
b[val]=length(b[val])>length($0)?(b[val]?b[val]:$0):$0
}
END{
for(i in a){
if(a[i]>1){
print c[i] ORS b[i]}
}
}' Input_file
Will add code with sequence in output too in sometime.
Upvotes: 1