Find position of string (DNA sequence) on reference with mismatches ("N")

I am trying to find start and end positions along a genome alignment for a priming region that is non-contiguous, so essentially there are 2 regions. Here's a simplified example:

genome = "GCTAGCTAGCTAGCTGACGCGATCGATGCTAGTTTCGGTGCGCTtAAAAAAGCTTGCATGAAGGCTAGCTA"
primer = "AGCTGANNNNNTCGATGC"

This would align like this:

I need to find the start and ending position on the genome where this priming region sits.

I tried doing this simply as a string by splitting where the N's are like this:

regions = primer.split('N')
start = genome.find(regions[0])
end = genome.find(regions[-1]) + len(regions[-1])

The problem with this is that on a large genome alignment, there will often be repeats of the shorter regions, so I end up with the wrong positions. As far as I can tell, there isn't anything in BioPython that does exactly this, and pairwise2 doesn't have a way to return the start and end positions.

Thank you.

Upvotes: 0

Answers (3)

Olvin Roght

Reputation: 7812

Without regular expression you can solve this task in that way:

genome = "GCTAGCTAGCTAGCTGACGCGATCGATGCTAGTTTCGGTGCGCTtAAAAAAGCTTGCATGAAGGCTAGCTA"
primer = "AGCTGANNNNNTCGATGC"

prefix, suffix = primer[:primer.find("N")], primer[primer.rfind("N") + 1:]

start_pos = end_pos = -1
while True:
    start_pos = genome.find(prefix, start_pos + 1)
    end_pos = start_pos < 0 and start_pos or genome.find(suffix, start_pos + len(prefix)) + len(suffix)
    if start_pos < 0 or end_pos - start_pos == len(primer):
       break

print(start_pos, end_pos)

Using regular expression:

import re
...

pattern = re.compile(primer.replace("N", "."))
match = pattern.search(genome)
if match:
    start_pos, end_pos = match.span()
    print(start_pos, end_pos)

To print it in format from question use next code:

print(genome)
print(" " * start_pos + "|" * len(prefix) + "." * (len(primer) - len(suffix) - len(prefix)) + "|" * len(suffix))
print("-" * start_pos + primer + "-" * (len(genome) - end_pos))

Upvotes: 1

Akshay Sehgal

Reputation: 19307

You could simply use regex for this as well.

import re

genome = "GCTAGCTAGCTAGCTGACGCGATCGATGCTAGTTTCGGTGCGCTtAAAAAAGCTTGCATGAAGGCTAGCTA"
primer = "AGCTGANNNNNTCGATGC"

#create regular expression from primer
reg = '[A-Z]+'.join([i for i in primer.split('N') if len(i)>0])

#Search for index where regular expression matches
idx = re.search(reg, genome).span()

#Output = (11,29) which is start and end of the match

#print both with primer aligned based on start to end index of match
print(genome)
print(''.join((['_']*idx[0])+list(primer)+(['_']*(len(genome)-idx[1]))))

GCTAGCTAGCTAGCTGACGCGATCGATGCTAGTTTCGGTGCGCTtAAAAAAGCTTGCATGAAGGCTAGCTA
___________AGCTGANNNNNTCGATGC__________________________________________

Regex as tested on regex101.com

Upvotes: 1

Chris

Reputation: 16182

import re

genome = "GCTAGCTAGCTAGCTGACGCGATCGATGCTAGTTTCGGTGCGCTtAAAAAAGCTTGCATGAAGGCTAGCTA"
primer = "AGCTGANNNNNTCGATGC"

split = re.split('N+', primer)
middle = len(primer)-len(''.join(split))

r = f'{split[0]}\w{{{middle}}}{split[-1]}'

re.finditer(r,genome)

# Taking index 0 here assuming there will be one match
# If there are multiple matches handle the printing by looping perhaps
pos = [(m.start(0), m.end(0)) for m in re.finditer(r, genome)][0]

print(genome)
print(' '*pos[0] + '|'*len(split[0]) + '.'*middle + '|'*len(split[-1]))
print('-'*pos[0] + primer + '-'*int(len(genome)-int(pos[0]+len(primer))))

Output

GCTAGCTAGCTAGCTGACGCGATCGATGCTAGTTTCGGTGCGCTtAAAAAAGCTTGCATGAAGGCTAGCTA
           ||||||.....|||||||
-----------AGCTGANNNNNTCGATGC------------------------------------------

Upvotes: 0

Find position of string (DNA sequence) on reference with mismatches (&quot;N&quot;)

Answers (3)

Related Questions

Find position of string (DNA sequence) on reference with mismatches ("N")