Reputation: 21

Is there a way to simplify this code?

I am doing some bioinformatics research, and I'm new to python. I wrote this code to interpret a file containing protein sequences. The file "bulk_sequences.txt" contains 71,423 lines of information within itself. Three lines refer to one protein sequence, this first line giving information, including the year the protein was found, (that's what the "/1945" stuff is all about)." With a smaller sample of 1000 lines, it works just fine. But with this large file I've given it, it seems to take an extremely long time. Is there something I can do to simplify this?

It is meant to sort through the file, sort it by year of discovery, and then assign all three lines of protein sequence data to an item within the array "sortedsqncs"

    import time
    start = time.time()



    file = open("bulk_sequences.txt", "r")
    fileread = file.read()
    bulksqncs = fileread.split("\n")
    year = 1933
    newarray = []
    years = []
    thirties = ["/1933","/1934","/1935","/1936","/1937","/1938","/1939","/1940","/1941","/1942"]## years[0]
    forties = ["/1943","/1944","/1945","/1946","/1947","/1948","/1949","/1950","/1951","/1952"]## years[1]
    fifties = ["/1953","/1954","/1955","/1956","/1957","/1958","/1959","/1960","/1961","/1962"]## years[2]
    sixties = ["/1963","/1964","/1965","/1966","/1967","/1968","/1969","/1970","/1971","/1972"]## years[3]
    seventies = ["/1973","/1974","/1975","/1976","/1977","/1978","/1979","/1980","/1981","/1982"]## years[4]
    eighties = ["/1983","/1984","/1985","/1986","/1987","/1988","/1989","/1990","/1991","/1992"]## years[5]
    nineties = ["/1993","/1994","/1995","/1996","/1997","/1998","/1999","/2000","/2001","/2002"]## years[6]
    twothsnds = ["/2003","/2004","/2005","/2006","/2007","/2008","/2009","/2010","/2011","/2012"]## years[7]

    years = [thirties,forties,fifties,sixties,seventies,eighties,nineties,twothsnds]
    count = 0
    sortedsqncs = []


    for x in range(len(years)):
        for i in range(len(years[x])):
                for y in bulksqncs:
                        if years[x][i] in y:
                            for n in range(len(bulksqncs)):
                                if y in bulksqncs[n]:
                                    sortedsqncs.append(bulksqncs[n:n+3])
                                    count +=1
    print len(sortedsqncs)

    end = time.time()
    print round((end - start),4)

Upvotes: 2

Answers (3)

chryss

Reputation: 7519

tcaswell's solution with itertools.izip_longest() is very elegant, but if you aren't using the higher level iteration tools very often, you may forget how it works and the code may become hard to understand in the future for you.

But tcaswell's fundamentally correct that you're looping over the file way too many times. Other inefficiencies, at least from a readability and maintainability point of view, are the predefined year arrays. Also you should pretty much never use range(len(seq)) -- there's nearly always a better (more pythonic) way. Last, use readlines() if you want a list of lines from a file.

A more pedestrian solution would be:

Write a function extract_year() as suggested by tcaswell to return the year from a line of input (bulksqncs), or None if no year is found. You could use a regular expression, or if you know the position of the year in the line, use that.
Loop through the input and extract all sequences, assigning each to a tuple (year, three-lines-of-sequence) and adding the tuples to a list. This also allows for input files that have non-sequences interspersed with sequences.
Sort the list of tuples by year.
Extract the sequences from the sorted list of tuples.

Example code - this will give you a Python list of sorted sequences:

bulksqncs = infile.readlines()
sq_tuple = []
for idx, line in enumerate(bulksqncs):
   if extract_year(line):
     sq_tuple.append((extract_year(line), bulksqncs[idx:idx+3]))
sq_tuple.sort()
sortedsqncs = ['\n'.join(item[1]) for item in sq_tuple]

Upvotes: 5

Stuart

Reputation: 9868

The problem is that every time you find a year in the line, you loop through the file another time (for n in range(len(bulksqncs))), so that in total you have something like 136 billion (=71423 * (71423 / 3) * 80) iterations. You can reduce this to under 6 million (71423 * 80) which will still take a bit of time but should be manageable.

A simple fix to your main loop would be to use enumerate to get the line number instead of having to loop through the entire file from the beginning again:

for decade in decades:
    for year in decade:
        for n, line in enumerate(bulksqncs):
            if year in line:
                sortedsqncs.append(bulksqncs[n:n + 3])
                count += 1

However the time can be reduced further by putting the years loop inside the loop that reads lines from the file. I would consider using a dictionary, and reading one line at a time from the file (instead of reading the whole thing in at once with read()). When you find a year in the line, you can use next to grab the next two lines as well as the one you are currently on. The programme then breaks out of the years loop, avoiding unnecessary iterations (assuming it's not possible to have more than one year in the same line).

years = ['/' + str(y) for y in range(1933, 2013)]
sequences = dict((year, []) for year in years)

with open("bulk_sequences.txt", "r") as bulk_sequences:
    for line in bulk_sequences:
        for year in years:
            if year in line:
                sequences[year].append((line, 
                                        bulk_sequences.next(),
                                        bulk_sequences.next()))
                break

The sorted list can then be obtained as

[sequences[year] for year in years]

Alternatively use an OrderedDict to keep the sequences in order.

Upvotes: 3

tacaswell

Reputation: 87566

The problem is that you are looping over your giant file an absurd number of times. You can do this in one pass:

from itertools import izip_longest

#http://docs.python.org/2/library/itertools.html
def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

# fold your list into a list of length 3 tuples
data = [n for n in grouper(bulksqncs, 3)]
# sort the list
# tuples will 'do the right thing' by default if the line starts with the year
data.sort()

If your year line doesn't start with the year, you will need to use the key kwarg to sort

data.sort(key=lamdba x: extract_year(x[0]))

Upvotes: 4

Is there a way to simplify this code?

Answers (3)

Related Questions