user1998510
user1998510

Reputation: 75

Checking for specific amino acids in specific positions in a multiple sequence alignment

There is a similar question on Stack Overflow, but it is using the Linux terminal (Search for specific characters in specific positions of line). I would like to do a similar thing using python, and I cant quite work out what is the pythonic way to do this without having to manually write for the membership checks.


I would like to search for specific amino acids at specific positions of a multiple sequence alignment. I have defined the positions of the amino acid alignment in a list of indexes ,

e.g Index = [1, 100, 235, 500]. 

I have defined the amino acids I want in those positions also.

Res1 = ["A","G"]
Res2 = ["T","F"]
Res3 = ["S,"W"]
Res4 = ["H","J"]

I am currently doing something like this :

for m in records_dict:
    if (records_dict[m].seq[Index[0]] \
        in Res1) and (records_dict[m].seq[Index[1]] \
        in Res2) and (records_dict[m].seq[Index[2]] \
        in Res3) and (records_dict[m].seq[Index[3]]\
        in Res4)
    print m

Now, suppose I have a list of 40 residues I want to check, I know I have to write the lists of residues to check manually, but surely, there is an easier way to do this membership check using a while loop or something else.

Also, is there any way I could incorporate a system where if no sequences match all the 40 membership checks, I would get the 5 best sequences that are closest to matching all 40 checks, and an output such as sequence "m" has 30/40 matches and a list of what those 30 matches are, and which 10 didn't match?

Upvotes: 0

Views: 902

Answers (1)

NiziL
NiziL

Reputation: 5140

I will assume you want to check if Res1 is at Index[0], Res2 at Index[1] and so on.

res = [Res1, Res2, Res3, Res4]
for m in records_dist:
    match = 0
    match_log = []
    for i in Index:
        if records_dict[m].seq[i] in res[i]:
            match += 1
            match_log.append(i)    

With this little code, you're able to compute the number of match, and track the indexes where match occurs for each records_dist values.

If you want to check if ResX is at multiple position, or if you don't want to order the indexes list like the Res list, I'd define a dictionary of list where keys are ResX and values are list of indexes:

to_check = {}
to_check[Res1] = [index1, index2]
to_check[Res2] = [index1, ..., indexN]
...
to_check[ResX] = [indexI, ..., indexJ]

then, using

match_log = {}
for m in records_dist:
    match_log[m] = {}
    for res, indexes in to_check:
        match_log[m][res] = []
        for i in indexes:
            if records_dict[m].seq[i] in res:
                match_log[m][res].append(i)
        nb_match = len(match_log[m][res])

or in a more pythonic way, using filter:

match_log = {}
for m in records_dist:
    match_log[m] = {}
    for res, indexes in to_check:
        match_log[m][res] = filter(lamba i: records_dict[m].seq[i] in res, indexes)
        nb_match = len(match_log[m][res])

Upvotes: 1

Related Questions