Axelio
Axelio

Reputation: 1

Comparing one element from a list to ALL elements of another list

I have a list that contains various sequences of letters.

sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

I want to see if the last 3 letter of each sequence in that list matches the first 3 letters of all the other sequences. If that happens, I want to know the indexes of these two sequences.

I'm basically trying to produce an adjacency list. Below is an example of an input:

>Sample_0
AAGTAAA
>Sample_1
AAATGAT
>Sample_2
AAAGTTT
>Sample_3
TTTTCCC
>Sample_4
AATTCGC
>Sample_5
CGCTCCC

And the output:

>Sample_0 >Sample_1
>Sample_0 >Sample_2
>Sample_2 >Sample_3
>Sample_4 >Sample_5

Now, I tried to make two different lists that contain all the prefixes and all the suffixes but I don't know if this can help and how to use this to solve my problem.

file = open("rosalind_grph2.txt", "r")

gene_names, sequences, = [], []
seq = ""

for line in file:
    if line[0] == ">":
        gene_names.append(line.strip())
        if seq == "":
            continue
        sequences.append(seq)
        seq = ""
    if line[0] in "ATCG":
        seq = seq + line.strip()
sequences.append(seq)

#So far I put all I needed into a list

prefix = [i[0:3] for i in sequences]
suffix = [i[len(i)-3:] for i in sequences]

#Now, all suffixes and prefixes are in lists as well
#but what now?  

print(suffix)
print(prefix)
print(sequences)
file.close

Upvotes: 0

Views: 104

Answers (2)

norok2
norok2

Reputation: 26886

If I understood correctly, what you would like to do is to connect different element of sequences, where the connection is that beginning of the string matches the end of the other string.

One way of doing this, using a dict is by using the following function match_head_tail():

def match_head_tail(items, length=3):
    result = {}
    for x in items:
        v = [y for y in items if y[:length] == x[-length:]]
        if v:
            result[x] = v
    return result
sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail(sequences))
# {'AAGTAAA': ['AAATGAT', 'AAAGTTT'], 'AAAGTTT': ['TTTTCCC'], 'AATTCGC': ['CGCTCCC']}

If you want to include also sequences for which there is no match you could use the following function match_head_tail_all():

def match_head_tail_all( items, length=3):
    return {x: [y for y in items if y[:length] == x[-length:]] for x in items}
sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail_all(sequences))
# {'AAGTAAA': ['AAATGAT', 'AAAGTTT'], 'AAATGAT': [], 'AAAGTTT': ['TTTTCCC'], 'TTTTCCC': [], 'AATTCGC': ['CGCTCCC'], 'CGCTCCC': []}

EDIT 1

If you actually want indexes, please combine the above with enumerate() to get them, e.g.:

def match_head_tail_all_indexes( items, length=3):
    return {
        i: [j for j, y in enumerate(items) if y[:length] == x[-length:]]
        for i, x in enumerate(items)}


sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail_all_indexes(sequences))
# {0: [1, 2], 1: [], 2: [3], 3: [], 4: [5], 5: []}

EDIT 2

If your input contains many sequences with the same ending, you may want to consider implementing some caching mechanism for improved computational efficiency (at the expenses of memory efficiency), e.g.:

def match_head_tail_cached(items, length=3, caching=True):
    result = {}
    if caching:
        cached = {}
    for x in items:
        if caching and x[-length:] in cached:
            v = cached[x[-length:]]
        else:
            v = [y for y in items if y[:length] == x[-length:]]    
        if v:
            result[x] = v
    return result


sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail_cached(sequences))
# {'AAGTAAA': ['AAATGAT', 'AAAGTTT'], 'AAAGTTT': ['TTTTCCC'], 'AATTCGC': ['CGCTCCC']}

EDIT 3

All this could also implemented with list only, e.g.:

def match_head_tail_list(items, length=3):
    result = []
    for x in items:
        v = [y for y in items if y[:length] == x[-length:]]
        if v:
            result.append([x, v])
    return result


sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail_list(sequences))
# [['AAGTAAA', ['AAATGAT', 'AAAGTTT']], ['AAAGTTT', ['TTTTCCC']], ['AATTCGC', ['CGCTCCC']]]

and even have less nesting:

def match_head_tail_flat(items, length=3):
    result = []
    for x in items:
        for y in items:
            if y[:length] == x[-length:]:
                result.append([x, y])
    return result


sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']

print(match_head_tail_flat(sequences))
# [['AAGTAAA', 'AAATGAT'], ['AAGTAAA', 'AAAGTTT'], ['AAAGTTT', 'TTTTCCC'], ['AATTCGC', 'CGCTCCC']]

Upvotes: 0

krmckone
krmckone

Reputation: 315

If I am understanding your problem correctly, this code enumerates over the list twice. It is comparing the last 3 letters of first element with the first 3 letters of the second element and prints the indices of the elements if there is a match. Please give feedback/clarify if this is not what you are looking for. This is O(n^2) and can likely be sped up if you take a initial pass and store indices in a structure like a dictionary.


for index1, sequence1 in enumerate(sequences):
    for index2, sequence2 in enumerate(sequences):
        if index1 != index2:
            if sequence1[-3:] == sequence2[0:3]:
                print(sequence1[-3:], index1, sequence2[0:3], index2)

Upvotes: 1

Related Questions