Reputation: 3263
I would like to retrieve the nth sequence (or preferably nth to mth sequence) from a input fasta file, ideally with a unix "one-liner".
I know I could read the sequence with perl (or any other scripting language), count, and then print the sequence, but I'm looking for something faster and more compact.
For those unaware, a sample fasta file looks like the following:
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Upvotes: 3
Views: 395
Reputation: 3756
sed one liner (no pipe |
needed):
sed '/>SEQUENCE_'$n'/, />SEQUENCE_'$(($m + 1))'/!d;{/>SEQUENCE_'$(($m + 1))'/d}' file
Upvotes: 2
Reputation: 77185
One way with awk
:
awk -v RS='>' -v start=$n -v end=$m 'NR>=(start+1)&&NR<=(end+1){print ">"$0}' fasta_file
Upvotes: 2
Reputation: 98118
With sed
:
sed -n '/SEQUENCE_'$n'/,/SEQUENCE_'$(($m + 1))'/p' input | sed '$d'
Upvotes: 2
Reputation: 54592
Here are two ways using awk
.
If your sequences are wrapped 1 per line, this would work:
awk -v n=5 -v m=8 'NR == n * 2 - 1, NR == m * 2' file.fa
If your sequence lines aren't wrapped, then this may be more appropriate:
awk -v n=5 -v m=8 '/^>/ { c++ } c == n { f=1 } c == m + 1 { f=0 } f' file.fa
Upvotes: 2