Reputation: 23
This python code is slowing down the longer it runs.
Can anyone please tell me why?
I hope it is not reindexing for every line I query and counting from start again, I thought it would be some kind of file-stream ?!
From 10k to 20k it takes 2 sec. from 300k to 310k it takes like 5 min. and getting worse. The code is only running in the ELSE part up to that point and 'listoflines' is constant at that point (850000 lines in list) and of type 'list[ ]' as well as 'offset' is just a constant 'int' at that point.
The source file has millions of lines up to over 20 million lines.
'dummyline not in listoflines' should take the same time every time.
with open(filename, "rt") as source:
for dummyline in source:
if (len(dummyline) > 1) and (dummyline not in listoflines):
# RUN compute
# this part is not reached where I have the problem
else:
if dummyalreadycheckedcounter % 10000 == 0:
print ("%d/%d: %s already checked or not valid " % (dummyalreadycheckedcounter, offset, dummyline) )
dummyalreadycheckedcounter = dummyalreadycheckedcounter +1
Upvotes: 0
Views: 50
Reputation: 565
actually in operation for list is not the same every time in fact it is O(n) so it gets slower and slower as you add
you want to use set See here https://wiki.python.org/moin/TimeComplexity
You didn't ask for this but I'd suggest turning this into a processing pipe line, so your compute part would not be mixed with the dedup logic
def dedupped_stream(filename):
seen = set()
with open(filename, "rt") as source:
for each_line in source:
if len(line)>1 and each_line not in seen:
seen.add(each_line)
yield each_line
then you can do just
for line in dedupped_stream(...):
...
you would not need to worry about deduplication here at all
Upvotes: 1
Reputation: 559
Same opinion with @Sedy Vlk. Use hash(that is dictionary in python) instead.
clines_count = {l: 0 for l in clines}
for line in nlines:
if len(line) > 1 and line in clines_count:
pass
else:
if counter % 10000 == 0:
print ("%d: %s already checked or not valid " % (counter, line) )
counter += 1
Upvotes: 1