Reputation:
I am trying to align matching sequences based on the first 5 and last 5 characters. So, for example:
AAATGCEGAIRPVOGJKKK
KKKTGKAFKEJWKLJFFFF
FFFKEJFWKFJWEKFJIGK
Would align and concatenate to create:
AAATGCEGAIRPVOGJKKKTGKAFKEJWKLJFFFFKEJFWKFJWEKFJIGK
Note that the mapping regions should not be repeated.I actually have more than 3 lines, and they aren't ordered so I am trying to create a loop to align all of them together. I'm not sure the best way to approach this.
Upvotes: 1
Views: 470
Reputation: 104712
I think you can solve this by making a dictionary of prefixes. Once you have it, start with an arbitrary sequence and find its suffix is in the prefix list. Then move on to the sequence that had that prefix, building a chain.
Here's some code:
def make_circular_overlapping_sequence(sequences, min_overlap=1, max_overlap=5):
# start by mapping prefixes to full sequences
prefixes = {}
for seq in sequences:
for length in range(min_overlap, max_overlap+1):
prefixes[seq[:length]] = seq
# pick arbitrary a start sequence
start = current = sequences[0]
# build a chain of sequences with overlapping suffixes and prefixes
chain = [start]
while True:
# try longest suffixes first
for length in range(max_overlap, min_overlap-1, -1):
suffix = current[-1-length:]
if suffix in prefixes:
current = prefixes[suffix]
if current == start: # looped around, so we're done
return "".join(chain)
chain.append(current[length+1:]) # don't duplicate the prefix
break
else: # for loop ended without breaking
raise ValueError("No match found for sequence {!r}"
.format(current))
Test output:
>>> sequences = '''AAATGCEGAIRPVOGJKKK
KKKTGKAFKEJWKLJFFFF
FFFKEJFWKFJWEKFJIGK
GKXYZ1234AAAT'''.split()
>>> make_circular_overlapping_sequence(sequences)
'AAATGCEGAIRPVOGJKKKTGKAFKEJWKLJFFFFKEJFWKFJWEKFJIGKXYZ1234AAAT'
Notes:
"...ABC"
which could be followed by either "ABC..."
or "BC..."
). Currently it always picks the longest overlap.["A...B", "B..C", "C..B"]
).return "".join(chain)[:-length]
Upvotes: 1