How to use a huge matching keywords file that using takes too long to process?

Question

I am using the following code in which I have a dictionary file, Dictionary.txt, and a search text file, SearchText.csv, and I am using regex to find and store the matching keywords and count them.

I have a problem: some of the files are thousands or hundreds of thousands of keywords and it takes too much time to process. I run the code on one dictionary which has 300,000 keywords and after an hour it hasn't written a single row.

So, what should I do to reduce the running time of this process?

import csv
import time
import re
allCities = open('Dictionary.txt', encoding="utf8").readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")
with open('SearchText.csv') as descriptions,open('Result---' + str(timestr) + '.csv', 'w', newline='') as output:
    descriptions_reader = csv.DictReader(descriptions)
    fieldnames = ['Sr_Num', 'Search', 'matched Keywords', 'Total matches']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line=0
    for eachRow in descriptions_reader:
        matches = 0
        Sr_Num = eachRow['Sr_Num']
        description = eachRow['Text']
        citiesFound = set()
        for eachcity in allCities:
            eachcity=eachcity.strip()
            if re.search('\b'+eachcity+'\b',description,re.IGNORECASE):
                citiesFound.add(eachcity)
                matches += 1
        if len(citiesFound)==0:
            output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " - ", 'Total matches' : matches})

        else:
            output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " , ".join(citiesFound), 'Total matches' : matches})
        line += 1
        print(line)

print(" Process Complete ! ")

Here is an example of some rows from Dictionary.txt:

les Escaldes
Andorra la Vella
Umm al Qaywayn
Ras al Khaimah
Khawr Fakkn
Dubai
Dibba Al Fujairah
Dibba Al Hisn
Sharjah
Ar Ruways

snakecharmerb · Accepted Answer

As user Martineau comments, it's best to profile code to determine where optimisation will be most effective, and to measure the effects of any attempted optimisations.

In the absence of any profiling data, the best candidate for optimisation is likely to be this inner loop:

for eachcity in allCities:
    eachcity=eachcity.strip()
    if re.search('\b'+eachcity+'\b',description,re.IGNORECASE):
        citiesFound.add(eachcity)

Here the code is calling strip on each string in allCities - something which could be done just once, outside the loop, then calling re.search for each city.

It may be more efficient to combine all the cities into a single regex using the | metacharacter, which denotes alternate matches. For example the pattern

r'foo|bar'

will match 'foo' or 'bar'.

Here's a simple example:

>>> text = 'I am flying from les Escaldas to Dubai via Sharjah on Monday'
>>> cities = ['Sharjah', 'Dubai', 'les Escaldas', 'Ar Ruways']
>>> pattern = r'|'.join(r'\b{}\b'.format(re.escape(city)) for city in cities)
>>> pattern
'\bSharjah\b|\bDubai\b|\bles Escaldas\b|\bAr Ruways\b'
>>> matches = re.findall(pattern, text)
>>> print(matches)
['les Escaldas', 'Dubai', 'Sharjah']

Calling re.escape on each city name prevents the search pattern being altered if one of the city names contains a character which is also a regular expression metacharacter.

Applying this technique to the code in the question:

# We only need to create the regex pattern once,
# so do it outside the loop.
pattern = r'|'.join(r'\b{}\b'.format(re.escape(city.strip())) for city in allCities)

for eachRow in descriptions_reader:
    Sr_Num = eachRow['Sr_Num']
    description = eachRow['Text']
    citiesFound = set()
    found = re.findall(pattern, description, re.IGNORECASE)
    citiesFound.update(found)
    matches = len(found)

The behaviour of the | metacharacter is described in the documentation.

REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.

So if there are potential matches which are substrings of other candidates - like "Dubai" and "Dubai City" - the longer candidate must appear earlier in the pattern, otherwise the engine will find the shorter and return it as the match.

To prevent this, sort allCities in descending order of length before creating the pattern:

allCities.sort(key=len, reverse=True)

or use sorted if the order of allCities must be preserved:

pattern = r'|'.join(r'\b{}\b'.format(re.escape(city.strip())) for
                    city in sorted(allCities, key=len, reverse=True))

How to use a huge matching keywords file that using takes too long to process?

Answers (2)

Related Questions