beiex
beiex

Reputation: 23

python file operation slowing down on massive text files

This python code is slowing down the longer it runs.

Can anyone please tell me why?

I hope it is not reindexing for every line I query and counting from start again, I thought it would be some kind of file-stream ?!

From 10k to 20k it takes 2 sec. from 300k to 310k it takes like 5 min. and getting worse. The code is only running in the ELSE part up to that point and 'listoflines' is constant at that point (850000 lines in list) and of type 'list[ ]' as well as 'offset' is just a constant 'int' at that point.

The source file has millions of lines up to over 20 million lines.

'dummyline not in listoflines' should take the same time every time.

with open(filename, "rt") as source:
    for dummyline in source:
        if (len(dummyline) > 1) and (dummyline not in listoflines):
            # RUN compute
            # this part is not reached where I have the problem
        else:
            if dummyalreadycheckedcounter % 10000 == 0:
            print ("%d/%d: %s already checked or not valid " % (dummyalreadycheckedcounter, offset, dummyline) )
            dummyalreadycheckedcounter = dummyalreadycheckedcounter +1

Upvotes: 0

Views: 50

Answers (2)

Sedy Vlk
Sedy Vlk

Reputation: 565

actually in operation for list is not the same every time in fact it is O(n) so it gets slower and slower as you add

you want to use set See here https://wiki.python.org/moin/TimeComplexity

You didn't ask for this but I'd suggest turning this into a processing pipe line, so your compute part would not be mixed with the dedup logic

def dedupped_stream(filename):
    seen = set()
    with open(filename, "rt") as source:
        for each_line in source:
            if len(line)>1 and each_line not in seen:
                seen.add(each_line)
                yield each_line

then you can do just

for line in dedupped_stream(...):
    ...

you would not need to worry about deduplication here at all

Upvotes: 1

정도유
정도유

Reputation: 559

Same opinion with @Sedy Vlk. Use hash(that is dictionary in python) instead.

clines_count = {l: 0 for l in clines}
for line in nlines:
    if len(line) > 1 and line in clines_count:
        pass
    else:
        if counter % 10000 == 0:
            print ("%d: %s already checked or not valid " % (counter, line) )
        counter += 1

Upvotes: 1

Related Questions