Nicholas Bock
Nicholas Bock

Reputation: 79

Befuddling Python string index out of range error

I looked through the other questions on this topic, but couldn't find something that really addresses what I'm trying to figure out.

The problem is this: I'm trying to create a program that looks for palindromes in two complementary strands of DNA, returning the position and length of each palindrome identified.

For instance, if given the sequence TTGATATCTT, the program should find the complement (AACTATAGAA), and then identify the second index as being the start of a 6-character palindrome.

I'm brand new to programming, so it might look totally goofy, but the code I came up with looks like this:

'''This first part imports the sequence (usually consisting of multiple lines of text)
from a file. I have a feeling there's an easier way of doing this, but I just don't
know what that would be.'''

length = 4
li = []
for line in open("C:/Python33/Stuff/Rosalind/rosalind_revp.txt"):
    if line[0] != ">":
        li.append(line)
seq = (''.join(li))

'''The complement() function takes the starting sequence and creates its complement'''

def complement(seq):
    li = []
    t = int(len(seq))
    for i in range(0, t):
        n = (seq[i])
        if n == "A":
            li.append(n.replace("A", "T"))        
        if n == "T":
            li.append(n.replace("T", "A"))
        if n == "C":
            li.append(n.replace("C", "G"))
        if n == "G":
            li.append(n.replace("G", "C"))
    answer = (''.join(li))
    return(answer)

'''the ip() function goes letter by letter, testing to see if it matches with the letter
x spaces in front of it on the complementary strand(x being specified by length). If the
letter doesn't match, it continues to the next one. After checking all possibilities for
one length, the function runs again with the argument length+1.'''

def ip(length, seq):
    n = 0
    comp = complement(seq)
    while length + n <= (len(seq)):
        for i in range(0, length-1):
            if seq[n + i] != comp[n + length - 1 - i]:
                n += 1
                break
            if (n + i) > (n + length - 1 - i):
                print(n + 1, length)
                n += 1
    if length <= 12:
        ip(length + 1, seq)

ip(length, seq)

The thing runs absolutely perfectly when starting with short sequences (TCAATGCATGCGGGTCTATATGCAT, for example), but with longer sequences, I invariably get this error message:

Traceback (most recent call last):
  File "C:/Python33/Stuff/Ongoing/palindrome.py", line 48, in <module>
    ip(length, seq)
  File "C:/Python33/Stuff/Ongoing/palindrome.py", line 39, in ip
    if seq[n + i] != comp[n + length - 1 - i]:
IndexError: string index out of range

The message is given after the program finishes checking the possible 4-character palindromes, before starting the function for length + 1.

I understand what the message is saying, but I don't understand why I'm getting it. Why would this work for some strings and not others? I've been checking for the past hour to see if it makes a difference whether the sequence has an odd number of characters or an even number of characters, is a multiple of 4, is just shy of a multiple of 4, etc. I'm stumped. What am I missing?

Any help would be appreciated.

P.S. The problem comes from the Rosalind Website (Rosalind.info), which uses 1-based numbering. Hence the print(n+1, length) at the end.

Upvotes: 2

Views: 2084

Answers (1)

martineau
martineau

Reputation: 123463

TheIndexErrorcan be avoided by changing the last line of:

if line[0] != ">":
    li.append(line)

to

if line[0] != ">":
    li.append(line.rstrip())

near the beginning of your code. This prevents any trailing whitespace, especially newlines, read from the file from becoming part of theseqstring. Having them in it is a problem because thecomplement()function ignores and thus removes them, so theanswerstring it returns isn't necessarily the same length as the input argument. This causes comp and seq to not be the same length in the inip()function.

You didn't ask, but here's how I'd shorten your code and make it more "Pythonic":

COMPLEMENT = str.maketrans("ATCG", "TAGC")
LENGTH = 4

with open("palindrome.txt") as input:
    seq = ''.join(line.rstrip() for line in input if line[0] != ">")

def complement(seq): return seq.translate(COMPLEMENT)

def ip(length, seq):
    n = 0
    comp = complement(seq)
    while length + n <= len(seq):
        for i in range(0, length-1):
            if seq[n + i] != comp[n + length - 1 - i]:
                n += 1
                break
            if n + i > n + length - 1 - i:
                print(n + 1, length)
                n += 1
    if length <= 12:
        ip(length + 1, seq)

print(repr(seq))
print(repr(complement(seq)))
ip(LENGTH, seq)

BTW, those two print() function calls added near the end are what gave me the clue about what was wrong.

Upvotes: 3

Related Questions