user1647556
user1647556

Reputation:

Aligning matching sequences

I am trying to align matching sequences based on the first 5 and last 5 characters. So, for example:

AAATGCEGAIRPVOGJKKK
KKKTGKAFKEJWKLJFFFF
FFFKEJFWKFJWEKFJIGK

Would align and concatenate to create:

AAATGCEGAIRPVOGJKKKTGKAFKEJWKLJFFFFKEJFWKFJWEKFJIGK 

Note that the mapping regions should not be repeated.I actually have more than 3 lines, and they aren't ordered so I am trying to create a loop to align all of them together. I'm not sure the best way to approach this.

Upvotes: 1

Views: 470

Answers (1)

Blckknght
Blckknght

Reputation: 104712

I think you can solve this by making a dictionary of prefixes. Once you have it, start with an arbitrary sequence and find its suffix is in the prefix list. Then move on to the sequence that had that prefix, building a chain.

Here's some code:

def make_circular_overlapping_sequence(sequences, min_overlap=1, max_overlap=5):
    # start by mapping prefixes to full sequences
    prefixes = {}
    for seq in sequences:
        for length in range(min_overlap, max_overlap+1):
            prefixes[seq[:length]] = seq

    # pick arbitrary a start sequence
    start = current = sequences[0]

    # build a chain of sequences with overlapping suffixes and prefixes
    chain = [start]
    while True:
        # try longest suffixes first
        for length in range(max_overlap, min_overlap-1, -1):
            suffix = current[-1-length:]
            if suffix in prefixes:
                current = prefixes[suffix]
                if current == start: # looped around, so we're done
                    return "".join(chain)
                chain.append(current[length+1:]) # don't duplicate the prefix
                break
        else: # for loop ended without breaking
            raise ValueError("No match found for sequence {!r}"
                             .format(current))

Test output:

>>> sequences = '''AAATGCEGAIRPVOGJKKK
KKKTGKAFKEJWKLJFFFF
FFFKEJFWKFJWEKFJIGK
GKXYZ1234AAAT'''.split()
>>> make_circular_overlapping_sequence(sequences)
'AAATGCEGAIRPVOGJKKKTGKAFKEJWKLJFFFFKEJFWKFJWEKFJIGKXYZ1234AAAT'

Notes:

  1. This code works for unambiguous circular sequences, and may not work as desired if there are ambiguous matches (e.g. "...ABC" which could be followed by either "ABC..." or "BC..."). Currently it always picks the longest overlap.
  2. The code raises an exception if the chain ends without looping, but it may run forever if there's a short loop that does not include the start element (e.g. ["A...B", "B..C", "C..B"]).
  3. The algorithm doesn't guarantee that all of the input sequences are included in the output. It just finds a loop and stops.
  4. The matching parts of the first and last values are not trimmed off. If you don't want that, I suggest slicing the final suffix out of the return value: return "".join(chain)[:-length]

Upvotes: 1

Related Questions