Paillou
Paillou

Reputation: 839

How to match two files according to variables?

I have two files - one with which looks like this (I only show one part):

>UniRef90_A0A0K2VG56 - Cluster: titin
MTTQAPTFTQPLQSVVALEGSAATFEAHVSGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRARLMIPAVTKANSGQYSLRATNGSGQATSTAELLVTAETAPPNFTQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIAEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISETRQTRIEKKIEQKIEAHFDAKSIAT
VEMVIDGATGQQLPHKTPPRIPPKPKSRSPTPPSVAAKAQLGRQQSPSPIRHSPSPVRHV
>UniRef90_UPI00045E3C3E - Cluster: titin isoform X25
MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWIRDGQVISTSTLPGVQISFSD
GRAKLTIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIAEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEEVPAKKTKTIVSTAQISESRQTRIEKKIEAHFDARSIATVEMV
IDGAAGQQLPHKTPPRIPPKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT

The second one, with several lines, only composed of Uniref90_XXXXXXX characters :

UniRef90_A0A0K2VG56 UniRef90_A0A0P5UY87 UniRef90_A0A0V0H4B3 UniRef90_A0A132GS96
UniRef90_A0A095VQ09 UniRef90_A0A0C1UI80 UniRef90_A0A1M4ZSK2 UniRef90_A0A1W1CJV7 UniRef90_A0A1Z9J2X0

What I want to do is getting a list, with the corresponding sequences (the letters ...RKMQAATAATG...) of the different Uniref90_XXXXXXX .

I mean, for the first line of my second file, I should get a list of the sequences for the 4 Uniref90_XXXXXXX . I do not want to keep the "Uniref90_XXXXXXX" characters of the second file, only the sequences.

A short example of what I need :

UniRef90_A0A0K2VG56 UniRef90_A0A0P5UY87

should give me :

MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWIRDGQVISTSTLPGVQISFSD
GRAKLTIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIAEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEEVPAKKTKTIVSTAQISESRQTRIE  ###UniRef90_A0A0K2VG56
VEMVIDGATGQQLPHKTPPRIPPKPKSRSPTPPSVAAKAQLGRQQSPSPIRHSPSPVRHV
RAPTPSPVRSVSPAGRISTSPIRSVKSPLLTRKMQAATAATGSEVPPPWKQESYMASSAE
AEMRETTMTSSTQIRREERWEGRYGVQE ###Uniref90_A0A0P5UY87

Is it possible in Python to do this?

Edit:

For the moment, I tried to create a dictionary with the Uniref90_XXXXX id as keys and the corresponding sequences as values.

f2=open("~/PROJET_M2/data/uniref90.fasta", "r")

fasta={}

for i in f2:
        i=i.rstrip("\n")
        if i.startswith(">"):
                l=next(f2,'').strip()   ### the problem is there I guess
                i=i[1:]
                i=i.split(" ")
                fasta[i[0]]=l
                print(fasta)

It does not work , I mean, the keys are well created but as you can see in the first file, there are several lines. This code only add the first line after the Uniref90_XXXXXXX id and not all lines.

Upvotes: 1

Views: 52

Answers (2)

alec_djinn
alec_djinn

Reputation: 10779

I have this little function to deal with FASTA sequences. It reads a file and output a dict of sequences. It deals with empty lines and sequences spanning multiple lines as well.

def parse_fasta(fasta_file):
    '''file_path => dict
    Return a dict of id:sequence pairs.
    '''
    d = {}
    _id = False
    seq = ''
    with open(fasta_file,'r') as f:
        for line in f:
            if line.startswith('\n'):
                continue
            if line.startswith('>'):
                if not _id:
                    _id = line.strip()[1:]
                elif _id and seq:
                    d.update({_id:seq})
                    _id = line.strip()[1:]
                    seq = ''
            else:
                seq += line.strip()
        d.update({_id:seq})
    return d

You just need to tweak _id = line.strip()[1:] to discard the part of the id-line you don't need. I guess _id = line.strip()[1:].split()[0] would be just enough.

Upvotes: 1

olinox14
olinox14

Reputation: 6643

You could build the dictionnay like this, using a simple buffer (current here):

with open("/path/to/file", "r") as f1:
    result, current_id, current = {}, None, ""
    for l in f1:
        print(l)

        if l[0] == ">":
            if current_id:
                result[current_id] = current
            current_id = l[1:].strip()
            current = ""
        else:
            current += l.strip()
    result[current_id] = current

About the with keyword: https://www.pythonforbeginners.com/files/with-statement-in-python

I assume the rest is ok for you?

Upvotes: 1

Related Questions