user7249622
user7249622

Reputation: 115

how to summarize a text file into a new one using awk

I have a text file like this small example:

>ENST00000599533.1|ENSG00000269831.1|-|-|AL669831.1-201|AL669831.1|43
FFYFIIWSLTLLPRAGLELLTSSDPPASASQSVGITGVSHHAQ
>ENST00000594233.1|ENSG00000269308.1|-|-|AL645608.2-201|AL669831.1|18
DFMHLFFIPSSELILPYP
>ENST00000420190.1|ENSG00000187634.6|OTTHUMG00000040719.8|OTTHUMT00000316521.1|SAMD11-011|SAMD11|179
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSR
>ENST00000437963.1|ENSG00000187634.6|OTTHUMG00000040719.8|OTTHUMT00000097862.3|SAMD11-003|SAMD11|109
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPT

this file has many parts and each part has 2 lines. the first line is ID starting with ">" and the 2nd line is a sequence of characters. in the ID line, the fields are "|" separated and looking at the 6th column there are many repeats of each ID. I want to make a new file from this one in which there is only one repeat of each ID and that is the one with the biggest number in column 7. so in fact in the new file I would have only one repeat of each ID (according to the 6th column) and that is the ID with the highest number (among the IDs with similar name in column6) in column 7. the expected output for the small example would be:

>ENST00000599533.1|ENSG00000269831.1|-|-|AL669831.1-201|AL669831.1|43
FFYFIIWSLTLLPRAGLELLTSSDPPASASQSVGITGVSHHAQ
>ENST00000420190.1|ENSG00000187634.6|OTTHUMG00000040719.8|OTTHUMT00000316521.1|SAMD11-011|SAMD11|179
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSR

to make this file I wrote this code:

awk -F"|" ' /^>/{(array1[val]=array[val]>length($0)) print array1}  Input.txt > out.txt

but it does not return anything. do you know how to fix it to get the expected output?

Upvotes: 0

Views: 84

Answers (2)

karakfa
karakfa

Reputation: 67497

if the repeated IDs are contiguous...

$ awk -F'|' -v RS='>' -v ORS='' 'NR>1 && p!=$6   {max=$7; r=rt $0; p=$6; print r; r=""}
                                 p==$6 && max<$7 {max=$7; r=rt $0} 
                                                 {rt=RT}
                                 END             {if(r) print r}' file

>ENST00000599533.1|ENSG00000269831.1|-|-|AL669831.1-201|AL669831.1|43
FFYFIIWSLTLLPRAGLELLTSSDPPASASQSVGITGVSHHAQ
>ENST00000420190.1|ENSG00000187634.6|OTTHUMG00000040719.8|OTTHUMT00000316521.1|SAMD11-011|SAMD11|179
MSKGILQVHPPICDCPGCRISSPVNRGRLADKRTVALPAARNLKKERTPSFSASDGDSDGSGPTCGRRPGLKQEDGPHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQSR

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133518

EDIT: Adding solution as per OP's need which will provide output in same order in which IDs are coming in Input_file itself.

awk -F"|" '
/^>/ && !d[$6]++{
  f[++count]=$6
}
/^>/{
  c[$6]=$0;
  a[$6]++;
  val=$6;
  getline;
  b[val]=length(b[val])>length($0)?(b[val]?b[val]:$0):$0
}
END{
  for(i=1;i<=count;i++){
    if(a[f[i]]>1){
      print c[f[i]] ORS b[f[i]]}
}}'  Input_file

If you are not worried about sequence of output(like it should be same as Input_file's sequence) then following may help you here.

awk -F"|" '
/^>/{
  c[$6]=$0;
  a[$6]++;
  val=$6;
  getline;
  b[val]=length(b[val])>length($0)?(b[val]?b[val]:$0):$0
}
END{
  for(i in a){
    if(a[i]>1){
     print c[i] ORS b[i]}
  }
}'  Input_file

Will add code with sequence in output too in sometime.

Upvotes: 1

Related Questions