hyhno01
hyhno01

Reputation: 177

Optimize runtime of pattern matching

Consider you had two lists, the first one consisting of 700 words and the second consisting of 30.000 possible sentence beginnings. There will be 21.000.000 combinations of sentence beginning and word.

Furthermore, there are about 400 files with some results for every possible sentence + word combination. Every file consists of 170.000.000 lines and has a structure as follows:

this is the first sentence
    1. result for that sentence
    2. result for that sentence
    ...
    nth result for that sentence

this is the second sentence
    ...

this is the nth sentence
    ...

For every possible sentence + word combination I would like to find the results file that carries some information about the combination (for every combination there is only one results file in which the combination occurs) and read the results out. You could do it in a for loop:

all_results = []

#create combinations
for sentence in sentencelist:
    for word in wordlist:
        combo = str(sentence + ' ' + word)

        #loop through results file while no result for combination has bin found
        c_result = []
        while not c_result:
            for resultsfilename in os.listdir(resultsdirectory):
                with open(resultsfilename, 'r') as infile:
                    results = infile.read().splitlines()
                if combo in results:
                    c_result = function_to_find_pattern_based_on_continuation(continuation, results)

        #append results and reset c_result
        all_results.append(c_result)
        c_result = []

However, this algorithm has quite a long runtime and I'm wondering how it could be improved. For instance, I'm wondering how I could prevent to load resultsfiles over and over again. Furthermore, I would like to create a copy of the resultsfiles and after the results of a sentence + word combination have been read out of a results file, they could be deleted in the copy (I don't want to change the files on the drive). However, every results file is about 7GB big, so it would not make sense to store every file in a variable, right?

Are there some other things, which could be used to improve the runtime?

Edit1: Adapted the size of the lists Edit2: Add while loop and comments in code

Upvotes: 1

Views: 50

Answers (1)

Object object
Object object

Reputation: 2054

You have 2 problems here as I understand it.

  1. you need some way to reduce I/O on several large files.
  2. you need a way to modify/copy some of these large files

There are a few ways I think you can solve these issues. Firstly if it is possible I would use a database like sqlite - this will remove your lots of file open/closing problem.

Secondly you can use pythons yield operator in your for loop (place it in its own function) and then iterate through it as a generator and edit it like a stream as you go. This will allow you to store the results (say in a file) without putting them all in a list which will run out of memory pretty fast by the sound of it.

Upvotes: 1

Related Questions