Reputation:
I have a file like the small example: small example:
>ENSG00000004142|ENST00000003607|POLDIP2|||2118
Sequence unavailable
>ENSG00000003056|ENST00000000412|M6PR|9099001;9102084|9099001;9102551|2756
CCAGGTTGTTTGCCTCTGGTCGGAAAGGGAAACTACCCCTGCTTCCACTCTGACAGCAGA
but I have too many "Sequence unavailable". I want to get rid of those transcripts. and the results would be like this:
>ENSG00000003056|ENST00000000412|M6PR|9099001;9102084|9099001;9102551|2756
CCAGGTTGTTTGCCTCTGGTCGGAAAGGGAAACTACCCCTGCTTCCACTCTGACAGCAGA
I tried to filter out those parts in bash using
grep -A 2 "Sequence" your.fa | grep -v "\-\-" | sed -n '/Sequence/!p' > new.fa
but it just removes "Sequence unavailable" but not its header (the line starts with ">"
above each sequence which is identifier for each sequence)
how can I filter out them in bash or python?
Upvotes: 0
Views: 40
Reputation: 47099
Assuming the row containing Sequence unavailable
as well as the row above should be removed, one can use this sed:
$ sed '$!N;/\nSequence unavailable$/d;P;D' input
Basically it works by reading two lines into the pattern space at the time, then printing the top one, and removing it from the pattern space, so leaving the current line in the pattern space, which result in always being a row behind:
$!N; # Append Next line to pattern space unless
# there are no more lines
/\nSequence unavailable$/d # Delete whole pattern space if regex is matched
P; # Print first line of pattern space
D # Delete first line of pattern space
The above works in GNU sed, one might need to change ;D
for ;$!D;q
to make to work with a strictly POSIX sed or one would have an endless loop.
Upvotes: 2