Python Improve List Compare Speed

Question

I need help improving the script speed, at first it works fine, but the more the scripts runs the slower it gets I always have to restart it over to get full speed. So I really need to find a way to speed it up.

How the script works:

It opens up saved .txt files skus_local (~100-400k lines) + keywords_local (~2million+ lines)
It gets url,category (~10k lines file) and loops for each url,category steps 3,5,6 so it's repeated process.
The script will scrape 2 lists new_skus (400 values) + new_keywords (1k max values)
The script will check new_skus with old_skus and a new upload_skus created with unique values.
Same for new_keywords + old_keywords
script will append to file upload_skus andupload_keywords`

I can see that steps 4,5,(6 maybe) when comparing causes the speed problem

        try:
            f = open(settings['skus_local'],"r")
            old_skus=f.read().split("
")[:-1]
            f.close()
            del f
        except:
            old_skus=[]
            f = open(settings['skus_local'],"w")
            f.close()
            del f
        skus_local_file = open(settings['skus_local'],"a")

        try:
            f = open(settings['keywords_local'], "r")
            old_keywords=f.read().split("
")[:-1]
            f.close()
            del f
        except:
            old_keywords=[]
            f = open(settings['keywords_local'], "w")
            f.close()
            del f
        keywords_local_file = open(settings['keywords_local'],"a")


        csv_reader_counter = 0
        for category, url in csv.reader(fp):
            if not (csv_reader_counter == fp_counter):
              csv_reader_counter = csv_reader_counter + 1
              continue

            print url,category

            new_skus, new_keywords = ScraperJP.main(url)

            upload_skus=[]

            for sku in new_skus:
                if sku not in old_skus:
                    upload_skus.append(sku)

            del new_skus

            if upload_skus!=[]:
                insert_products.main(settings['admin_url'],settings['username'],settings['password'],upload_skus,category)
                for sku in upload_skus:
                    skus_local_file.write(sku+"
")
                    old_skus.append(sku)
                skus_local_file.flush()
                del upload_skus

            upload_keywords=[]

            for urls in new_keywords:
                if urls not in old_keywords:
                    upload_keywords.append(urls)
            del new_keywords

            if upload_keywords!=[]:
                for keyword in upload_keywords:
                    keywords_local_file.write(keyword+"
")
                    old_keywords.append(keyword)
                keywords_local_file.flush()
            del upload_keywords

            csv_reader_counter = csv_reader_counter + 1
            fp_counter = fp_counter + 1
            fl = open('lineno.txt',"w")
            fl.write(str(fp_counter))
            fl.close()
            gc.collect()

        os.remove('lineno.txt')
        skus_local_file.close()
        keywords_local_file.close()
        fp.close()
        del skus_local_file
        del keywords_local_file
        del fp
if __name__=='__main__':
    main()

Padraic Cunningham · Accepted Answer

Store the information in sets.

To check for new content you just need new_skus - old_skus.

So instead of lines like:

for sku in new_skus:
    if sku not in old_skus:
       upload_skus.append(sku)

You can use new_skus.difference(old_skus) which will give elements in new_skus but not in old_skus.

If you want to store the set, you can use pickle.

import pickle

s = {1,2,3,4}
with open("s.pick","wb") as f: # pickle it to file
    pickle.dump(s,f)

with open("s.pick","rb") as f1:
    un_p = pickle.loads(f1.read()) # unpickle and use

print un_p

set([1, 2, 3, 4])

You can also append objects to one file:

s2 = {4,5,6,7}

import pickle

with open("s.pick","ab") as f:
    pickle.dump(s2,f)


with open("s.pick","rb") as f1:
    s1 = pickle.load(f1)
    s2 = pickle.load(f1)
    print s1,s2
set([1, 2, 3, 4]) set([4, 5, 6, 7])

Example of using sets:

s1={1, 2, 3, 4}
s2={4, 5, 6, 7}
s3={8,9,10,11}
print s1.difference(s2)
print s1.union(s2,s3)
set([1, 2, 3]) # in set 1 bit not in set 2
set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) # all elements in s1,s2 and s3

You can add contents of one set to another using:

s1.update(s2) #  add contents of s2 to s1
print "updated s1 with contents of s2", s1
updated s1 with contents of s2 set([1, 2, 3, 4, 5, 6, 7])

Python Improve List Compare Speed

Answers (1)

Related Questions