Reputation: 141
I have a file like this (with +10000 sequences, +98000 lines):
>DILT_0000000001-mRNA-1
MKVVKICSKLRKFIESRKDAVLPEQEEVLADLWAFEGISEFQMERFAKAAQCFQHQYELA
IKANLTEHASRSLENLGRARARLYDYQGALDAWTKRLDYEIKGIDKAWLHHEIGRAYLEL
NQYEEAIDHAATARDVADREADMEWDLNATVLIAQAHFYAGNLEEAKVYFEAAQNAAFRK
GFFKAESVLAEAIAEVDSEIRREEAKQERVYTKHSVLFNEFSQRAVWSEEYSEELHLFPF
AVVMLRCVLARQCTVHLQFRSCYNL
>DILT_0000000101-mRNA-1
MSCRRLSMNPGEALIKESSAPSRENLLKPYFDEDRCKFRHLTAEQFSDIWSHFDLDGVNE
LRFILRVPASQQAGTGLRFFGYISTEVYVHKTVKVSYIGFRKKNNSRALRRWNVNKKCSN
AVQMCGTSQLLAIVGPHTQPLTNKLCHTDYLPLSANFA
>DILT_0001999301-mRNA-1
LEHGIQPDGQMPSDKTIGGGDDSFQTFFSETGAGKHVPRAVMVDLEPTVIGEYLCVLLTS
FILFRLISTNLGPNSQLASRTLLFAADKTTLFRLLGLLPWSLLKIAVQ
>DILT_0001999401-mRNA-1
MAENGEDANMPEEGKEGNTQDQGEHQQDVQSDEPNEADSGYSSAASSDVNSQTIPITVIL
PNREAVNLSFDPNISVSELQERLNGPGITRLNENLFFTYSGKQLDPNKTLLDYKVQKSST
LYVHETPTALPKSAPNAKEEGVVPSNCLIHSGSRMDENRCLKEYQLTQNSVIFVHRPTAN
TAVQNREEKTSSLEVTVTIRETGNQLHLPINPHXXXXTVEMHVAPGVTVGDLNRKIAIKQ
all the lines with the '>' are IDs. The following lines are the sequences regarding the ID.
I also have a file with the IDs of the sequences I want, like:
DILT_0000000001-mRNA-1
DILT_0000000101-mRNA-1
DILT_0000000201-mRNA-1
DILT_0000000301-mRNA-1
DILT_0000000401-mRNA-1
DILT_0000000501-mRNA-1
DILT_0000000601-mRNA-1
DILT_0000000701-mRNA-1
DILT_0000000801-mRNA-1
DILT_0000000901-mRNA-1
I want to write a script to match the ids and copy the sequences of this IDs, but I'm just getting the IDs, without the sequences.
seqs = open('WBPS10.protein.fa').readlines()
ids = open('ids.txt').readlines()
for line in ids:
for record in seqs:
if line == record[1:]:
print record
I don't know what to write to get the 'n' lines after the ID, because sometimes it's 2 lines, other sequences have more as you can see in my example.
The thing is, I'm trying to do it without using Biopython, which would be a lot easier. I just want to learn other ways.
Upvotes: 1
Views: 362
Reputation: 632
seqs_by_ids = {}
with open('WBPS10.protein.fa', 'r') as read_file:
for line in read_file.readlines():
if line.startswith('>'):
current_key = line[1:].strip()
seqs_by_ids[current_key] = ''
else:
seqs_by_ids[current_key] += line.strip()
ids = set([line.strip() for line in open('ids.txt').readlines()])
for id in ids:
if id in seqs_by_ids:
print(id)
print('\t{}'.format(seqs_by_ids[id]))
output:
DILT_0000000001-mRNA-1
MKVVKICSKLRKFIESRKDAVLPEQEEVLADLWAFEGISEFQMERFAKAAQCFQHQYELAIKANLTEHASRSLENLGRARARLYDYQGALDAWTKRLDYEIKGIDKAWLHHEIGRAYLELNQYEEAIDHAATARDVADREADMEWDLNATVLIAQAHFYAGNLEEAKVYFEAAQNAAFRKGFFKAESVLAEAIAEVDSEIRREEAKQERVYTKHSVLFNEFSQRAVWSEEYSEELHLFPFAVVMLRCVLARQCTVHLQFRSCYNL
DILT_0000000101-mRNA-1
MSCRRLSMNPGEALIKESSAPSRENLLKPYFDEDRCKFRHLTAEQFSDIWSHFDLDGVNELRFILRVPASQQAGTGLRFFGYISTEVYVHKTVKVSYIGFRKKNNSRALRRWNVNKKCSNAVQMCGTSQLLAIVGPHTQPLTNKLCHTDYLPLSANFA
Upvotes: 1
Reputation: 1445
This should work for you. if line == record[1:]:
statement will not work if there is some special char in string e.g \r\n . You are interested in finding the matching IDs only. Code below will work for you.
Code sample
seqs = open('WBPS10.protein.fa').readlines()
ids = open('ids.txt').readlines()
for line in ids:
for record in seqs:
if line in record :
print record
output :
>DILT_0000000001-mRNA-1
>DILT_0000000101-mRNA-1
Upvotes: 0