Python Error Checking Script is Super Slow

Question

I have the following program that has been running for about two hours and has maybe 1/4 more to go. My questions are below the code:

import csv

input_csv = "LOCATION_ID.csv"
input2 = "CITIES.csv"
output_csv = "OUTPUT_CITIES.csv"

with open(input_csv, "rb") as infile:
    input_fields = ("ID", "CITY_DECODED", "CITY", "STATE", "COUNTRY", "SPELL1", "SPELL2", "SPELL3")
    reader = csv.DictReader(infile, fieldnames = input_fields)
    with open(input2, "rb") as infile2:
        input_fields2 = ("Latitude", "Longitude", "City")
        reader2 = csv.DictReader(infile2, fieldnames = input_fields2)
        next(reader2)
        words = []
        for next_row in reader2:
            words.append(next_row["City"])

        with open(output_csv, "wb") as outfile:
            output_fields = ("EXISTS","ID", "CITY_DECODED", "CITY", "STATE", "COUNTRY", "SPELL1", "SPELL2", "SPELL3")
            writer = csv.DictWriter(outfile, fieldnames = output_fields)
            writer.writerow(dict((h,h) for h in output_fields))
            next(reader)
            for next_row in reader:
                search_term = next_row["CITY_DECODED"]

                #I think the problem is here where I run through every city
                #in "words", even though all I want to know is if the city
                #in "search_term" exists in "words
                for item in words:
                    if search_term in words:
                        next_row["EXISTS"] = 1

                writer.writerow(next_row)

I have a few questions here:

1 Given that input_csv has 14k rows and input2 has only 6k rows, why does this take so long? I understand that the inner-most for loop (beginning "for item in words:") is inefficient (see qtn 3), but I'm hoping for more of an intuitive/image of what is happening behind the scenes so that I (and hopefully other SO-users) can avoid make this same mistake on other of our programs.

2 If I want this code to continue to run, how does this relate to my leaving the computer and it going to sleep/hibernating? Will the program stop at that point, but start up again on its own when the computer is in use again? I'm really wondering how the compiler once running a program interacts with the operating system and what is means for a computer to "go to sleep" as it relates to a python program.

and 3 What is a more efficient implementation of this code? I'm not wrong in thinking it shouldn't take more than a few minutes to do this, right?

Thanks very much!

Logan · Accepted Answer

Let's start with one spot of inefficiency I see:

for next_row in reader:
                search_term = next_row["CITY_DECODED"]
                for item in words:
                    if search_term in words:
                        next_row["EXISTS"] = 1

That's 14k iterations of the outer for loop. Then, roughly 6k iterations each time in the nested for loop. Then more iterations when you do if search_term in words, because it iterates over words until it returns.

I haven't put too much thought into what this algorithm is actually doing, but you should at the very least remove duplicates in words (ie, words = list(set(words))).

I was about to post about that little for item in words loop. It was confusing me as to why you did that as items is never used and so the for loop is a big waste of time.

It most likely can be reduced to:

for next_row in reader:
    search_term = next_row["CITY_DECODED"]
    if search_term in words:
        next_row["EXISTS"] = 1
    writer.writerow(next_row)

So, let's sum up all the iterations you had:

~6k for for next_row in reader2: words.append(next_row["City"])

~14k iterations of for next_row in reader: MULTIPLIED by the sum(i,1,6000) which is about 252 billion.

Taking out the extraneous loop gives you around 84 million iterations which is.. well, much better.

Python Error Checking Script is Super Slow

Answers (1)

Related Questions