Reputation: 1
I have a list that contains various sequences of letters.
sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']
I want to see if the last 3 letter of each sequence in that list matches the first 3 letters of all the other sequences. If that happens, I want to know the indexes of these two sequences.
I'm basically trying to produce an adjacency list. Below is an example of an input:
>Sample_0
AAGTAAA
>Sample_1
AAATGAT
>Sample_2
AAAGTTT
>Sample_3
TTTTCCC
>Sample_4
AATTCGC
>Sample_5
CGCTCCC
And the output:
>Sample_0 >Sample_1
>Sample_0 >Sample_2
>Sample_2 >Sample_3
>Sample_4 >Sample_5
Now, I tried to make two different lists that contain all the prefixes and all the suffixes but I don't know if this can help and how to use this to solve my problem.
file = open("rosalind_grph2.txt", "r")
gene_names, sequences, = [], []
seq = ""
for line in file:
if line[0] == ">":
gene_names.append(line.strip())
if seq == "":
continue
sequences.append(seq)
seq = ""
if line[0] in "ATCG":
seq = seq + line.strip()
sequences.append(seq)
#So far I put all I needed into a list
prefix = [i[0:3] for i in sequences]
suffix = [i[len(i)-3:] for i in sequences]
#Now, all suffixes and prefixes are in lists as well
#but what now?
print(suffix)
print(prefix)
print(sequences)
file.close
Upvotes: 0
Views: 104
Reputation: 26886
If I understood correctly, what you would like to do is to connect different element of sequences
, where the connection is that beginning of the string matches the end of the other string.
One way of doing this, using a dict
is by using the following function match_head_tail()
:
def match_head_tail(items, length=3):
result = {}
for x in items:
v = [y for y in items if y[:length] == x[-length:]]
if v:
result[x] = v
return result
sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']
print(match_head_tail(sequences))
# {'AAGTAAA': ['AAATGAT', 'AAAGTTT'], 'AAAGTTT': ['TTTTCCC'], 'AATTCGC': ['CGCTCCC']}
If you want to include also sequences for which there is no match you could use the following function match_head_tail_all()
:
def match_head_tail_all( items, length=3):
return {x: [y for y in items if y[:length] == x[-length:]] for x in items}
sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']
print(match_head_tail_all(sequences))
# {'AAGTAAA': ['AAATGAT', 'AAAGTTT'], 'AAATGAT': [], 'AAAGTTT': ['TTTTCCC'], 'TTTTCCC': [], 'AATTCGC': ['CGCTCCC'], 'CGCTCCC': []}
If you actually want indexes, please combine the above with enumerate()
to get them, e.g.:
def match_head_tail_all_indexes( items, length=3):
return {
i: [j for j, y in enumerate(items) if y[:length] == x[-length:]]
for i, x in enumerate(items)}
sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']
print(match_head_tail_all_indexes(sequences))
# {0: [1, 2], 1: [], 2: [3], 3: [], 4: [5], 5: []}
If your input contains many sequences with the same ending, you may want to consider implementing some caching mechanism for improved computational efficiency (at the expenses of memory efficiency), e.g.:
def match_head_tail_cached(items, length=3, caching=True):
result = {}
if caching:
cached = {}
for x in items:
if caching and x[-length:] in cached:
v = cached[x[-length:]]
else:
v = [y for y in items if y[:length] == x[-length:]]
if v:
result[x] = v
return result
sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']
print(match_head_tail_cached(sequences))
# {'AAGTAAA': ['AAATGAT', 'AAAGTTT'], 'AAAGTTT': ['TTTTCCC'], 'AATTCGC': ['CGCTCCC']}
All this could also implemented with list
only, e.g.:
def match_head_tail_list(items, length=3):
result = []
for x in items:
v = [y for y in items if y[:length] == x[-length:]]
if v:
result.append([x, v])
return result
sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']
print(match_head_tail_list(sequences))
# [['AAGTAAA', ['AAATGAT', 'AAAGTTT']], ['AAAGTTT', ['TTTTCCC']], ['AATTCGC', ['CGCTCCC']]]
and even have less nesting:
def match_head_tail_flat(items, length=3):
result = []
for x in items:
for y in items:
if y[:length] == x[-length:]:
result.append([x, y])
return result
sequences = ['AAGTAAA', 'AAATGAT', 'AAAGTTT', 'TTTTCCC', 'AATTCGC', 'CGCTCCC']
print(match_head_tail_flat(sequences))
# [['AAGTAAA', 'AAATGAT'], ['AAGTAAA', 'AAAGTTT'], ['AAAGTTT', 'TTTTCCC'], ['AATTCGC', 'CGCTCCC']]
Upvotes: 0
Reputation: 315
If I am understanding your problem correctly, this code enumerates over the list twice. It is comparing the last 3 letters of first element with the first 3 letters of the second element and prints the indices of the elements if there is a match. Please give feedback/clarify if this is not what you are looking for. This is O(n^2) and can likely be sped up if you take a initial pass and store indices in a structure like a dictionary.
for index1, sequence1 in enumerate(sequences):
for index2, sequence2 in enumerate(sequences):
if index1 != index2:
if sequence1[-3:] == sequence2[0:3]:
print(sequence1[-3:], index1, sequence2[0:3], index2)
Upvotes: 1