Reputation: 199
I am new to python and I would like to know if what I am attempting is possible. I have a section here from a DNA alignment and I was wondering if for each location of a gap "-" on the bottom I could identify the nucleotide on the top line. Here I would be looking to return "G".
My efforts so far have not been successful. The alignment is:
ATTCAGGCCTAGCA
::::: :: ::::
ATTCAA-CCAAGCA
I appreciate any assistance!
Upvotes: 1
Views: 369
Reputation: 11
You'd better use biopython library. It has many data types designed to manipulate DNA, RNA and protein sequences (alignments, trees, etc). In this case AlignIO from biopython package will definitely help you.
from Bio import AlignIO
# reading your sequences:
alignment = AlignIO.read("my_seq.fa", "fasta")
# length of any alignment row is equal, so number of columns is here
cols = len(alignment[0])
# access to the rows and columns is like in the Numpy array
for col in range(cols):
if alignment[ : , col][1] == "-":
print("gap!")
Upvotes: 1
Reputation: 2083
above = 'ATTCAGGCCTAGCA'
below = 'ATTCAA-CCAAGCA'
gap_letters = [above[i] for i,j in enumerate(below) if j=='-']
Upvotes: 1
Reputation: 396
Not sure how your data is saved. Let's say it's two equal length strings in a tuple:
dna_pair = ('ATTCAGGCCTAGCA','ATTCAA-CCAAGCA')
Then you could try:
def find_align(dna_pair):
for i in range(len(dna_pair[0])):
if dna_pair[1][i] == '-':
return dna_pair[0][i]
Upvotes: 1
Reputation: 949
As I don't have any information about the data format, I will tell you the general process. Create 2 lists with the first and last line respectively (which I suppose are aligned and have the same length) and iterate over them. At each step verify if the character at the current position in the last array is a '-' and if so, print the character from the other array.
Upvotes: 1