Reputation: 3263

get a specific sequence from a fasta file with Regex

I would like to retrieve the n^th sequence (or preferably n^th to m^th sequence) from a input fasta file, ideally with a unix "one-liner".

I know I could read the sequence with perl (or any other scripting language), count, and then print the sequence, but I'm looking for something faster and more compact.

For those unaware, a sample fasta file looks like the following:

>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Upvotes: 3

Answers (4)

captcha

Reputation: 3756

sed one liner (no pipe | needed):

sed '/>SEQUENCE_'$n'/, />SEQUENCE_'$(($m + 1))'/!d;{/>SEQUENCE_'$(($m + 1))'/d}' file

Upvotes: 2

jaypal singh

Reputation: 77185

One way with awk:

awk -v RS='>' -v start=$n -v end=$m 'NR>=(start+1)&&NR<=(end+1){print ">"$0}' fasta_file

Upvotes: 2

perreal

Reputation: 98118

With sed:

sed -n '/SEQUENCE_'$n'/,/SEQUENCE_'$(($m + 1))'/p' input | sed '$d'

Upvotes: 2

Steve

Reputation: 54592

Here are two ways using awk.

If your sequences are wrapped 1 per line, this would work:

awk -v n=5 -v m=8 'NR == n * 2 - 1, NR == m * 2' file.fa

If your sequence lines aren't wrapped, then this may be more appropriate:

awk -v n=5 -v m=8 '/^>/ { c++ } c == n { f=1 } c == m + 1 { f=0 } f' file.fa

Upvotes: 2

get a specific sequence from a fasta file with Regex

Answers (4)

Related Questions