CelineDion
CelineDion

Reputation: 1088

Constantly getting IndexError and am unsure why in Python

I am new to python and really programming in general and am learning python through a website called rosalind.info, which is a website that aims to teach through problem solving.

Here is the problem, wherein you're asked to calculate the percentage of guanine and thymine to the string of DNA given to for each ID, then return the ID of the sample with the greatest percentage.

I'm working on the sample problem on the page and am experiencing some difficulty. I know my code is probably really inefficient and cumbersome but I take it that's to be expected for those who are new to programming.

Anyway, here is my code.

gc = open("rosalind_gcsamp.txt","r")
biz = gc.readlines()
i = 0
gcc = 0
d = {}
for i in xrange(biz.__len__()):
    if biz[i].startswith(">"):
        biz[i] = biz[i].replace("\n","")
        biz[i+1] = biz[i+1].replace("\n","") + biz[i+2].replace("\n","")
        del biz[i+2]

What I'm trying to accomplish here is, given input such as this:

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG

Break what's given into a list based on the lines and concatenate the two lines of DNA like so:

['>Rosalind_6404', 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG', 'TCCCACTAATAATTCTGAGG\n']

And delete the entry two indices after the ID, which is >Rosalind. What I do with it later I still need to figure out.

However, I keep getting an index error and can't, for the life of me, figure out why. I'm sure it's a trivial reason, I just need some help.

I've even attempted the following to limited success:

for i in xrange(biz.__len__()):
if biz[i].startswith(">"):
    biz[i] = biz[i].replace("\n","")
    biz[i+1] = biz[i+1].replace("\n","") + biz[i+2].replace("\n","")
elif biz[i].startswith("A" or "C" or "G" or "T") and biz[i+1].startswith(">"):
    del biz[i]

which still gives me an index error but at least gives me the biz value I want.

Thanks in advance.

Upvotes: 1

Views: 75

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

It is very easy do with itertools.groupby using lines that start with > as the keys and as the delimiters:

from itertools import groupby
with open("rosalind_gcsamp.txt","r") as gc:
    # group elements using  lines that start with ">" as the delimiter
    groups = groupby(gc, key=lambda x: not x.startswith(">"))
    d = {}
    for k,v in groups:
        # if k is False we a non match to our not x.startswith(">")
        # so use the value v as the key and call next on the grouper object
        # to get the next value
        if not k:
            key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
            d[key] = val

print(d)
{'>Rosalind_0808': 'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT', '>Rosalind_5959': 'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC', '>Rosalind_6404': 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'}

If you need order use a collections.OrderedDict in place of d.

Upvotes: 1

Klaus D.
Klaus D.

Reputation: 14369

You are looping over the length of biz. So in your last iteration biz[i+1] and biz[i+2] don't exist. There is no item after the last.

Upvotes: 1

Related Questions