Reputation: 839
I have two files - one with which looks like this (I only show one part):
>UniRef90_A0A0K2VG56 - Cluster: titin
MTTQAPTFTQPLQSVVALEGSAATFEAHVSGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRARLMIPAVTKANSGQYSLRATNGSGQATSTAELLVTAETAPPNFTQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIAEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISETRQTRIEKKIEQKIEAHFDAKSIAT
VEMVIDGATGQQLPHKTPPRIPPKPKSRSPTPPSVAAKAQLGRQQSPSPIRHSPSPVRHV
>UniRef90_UPI00045E3C3E - Cluster: titin isoform X25
MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWIRDGQVISTSTLPGVQISFSD
GRAKLTIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIAEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEEVPAKKTKTIVSTAQISESRQTRIEKKIEAHFDARSIATVEMV
IDGAAGQQLPHKTPPRIPPKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT
The second one, with several lines, only composed of Uniref90_XXXXXXX characters :
UniRef90_A0A0K2VG56 UniRef90_A0A0P5UY87 UniRef90_A0A0V0H4B3 UniRef90_A0A132GS96
UniRef90_A0A095VQ09 UniRef90_A0A0C1UI80 UniRef90_A0A1M4ZSK2 UniRef90_A0A1W1CJV7 UniRef90_A0A1Z9J2X0
What I want to do is getting a list, with the corresponding sequences (the letters ...RKMQAATAATG...) of the different Uniref90_XXXXXXX .
I mean, for the first line of my second file, I should get a list of the sequences for the 4 Uniref90_XXXXXXX . I do not want to keep the "Uniref90_XXXXXXX" characters of the second file, only the sequences.
A short example of what I need :
UniRef90_A0A0K2VG56 UniRef90_A0A0P5UY87
should give me :
MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWIRDGQVISTSTLPGVQISFSD
GRAKLTIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIAEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEEVPAKKTKTIVSTAQISESRQTRIE ###UniRef90_A0A0K2VG56
VEMVIDGATGQQLPHKTPPRIPPKPKSRSPTPPSVAAKAQLGRQQSPSPIRHSPSPVRHV
RAPTPSPVRSVSPAGRISTSPIRSVKSPLLTRKMQAATAATGSEVPPPWKQESYMASSAE
AEMRETTMTSSTQIRREERWEGRYGVQE ###Uniref90_A0A0P5UY87
Is it possible in Python to do this?
Edit:
For the moment, I tried to create a dictionary with the Uniref90_XXXXX id as keys and the corresponding sequences as values.
f2=open("~/PROJET_M2/data/uniref90.fasta", "r")
fasta={}
for i in f2:
i=i.rstrip("\n")
if i.startswith(">"):
l=next(f2,'').strip() ### the problem is there I guess
i=i[1:]
i=i.split(" ")
fasta[i[0]]=l
print(fasta)
It does not work , I mean, the keys are well created but as you can see in the first file, there are several lines. This code only add the first line after the Uniref90_XXXXXXX id and not all lines.
Upvotes: 1
Views: 52
Reputation: 10779
I have this little function to deal with FASTA sequences. It reads a file and output a dict of sequences. It deals with empty lines and sequences spanning multiple lines as well.
def parse_fasta(fasta_file):
'''file_path => dict
Return a dict of id:sequence pairs.
'''
d = {}
_id = False
seq = ''
with open(fasta_file,'r') as f:
for line in f:
if line.startswith('\n'):
continue
if line.startswith('>'):
if not _id:
_id = line.strip()[1:]
elif _id and seq:
d.update({_id:seq})
_id = line.strip()[1:]
seq = ''
else:
seq += line.strip()
d.update({_id:seq})
return d
You just need to tweak _id = line.strip()[1:]
to discard the part of the id-line you don't need. I guess _id = line.strip()[1:].split()[0]
would be just enough.
Upvotes: 1
Reputation: 6643
You could build the dictionnay like this, using a simple buffer (current
here):
with open("/path/to/file", "r") as f1:
result, current_id, current = {}, None, ""
for l in f1:
print(l)
if l[0] == ">":
if current_id:
result[current_id] = current
current_id = l[1:].strip()
current = ""
else:
current += l.strip()
result[current_id] = current
About the
with
keyword: https://www.pythonforbeginners.com/files/with-statement-in-python
I assume the rest is ok for you?
Upvotes: 1