Python regex problems with string matching

Question

Sorry in advance for such a long post

EDIT--

Modified from Norman's Solution to print and return if we find an exact solution, otherwise print all approximate matches. It's currently still only getting 83/85 matches for a specific example of searching for etnse on the dictionary file provided below on the third pastebin link.

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = []

    begin = time.time()

    # get all of the patterns associated with the given phrase
    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        for m in re.finditer(pattern, entireFile):
            startIndices.append((m.start(), m.end(), m.group()))
        # if the first pattern(exact match) is valid, then just print the results and we're done
        if len(startIndices) != 0 and startIndices[0][2] == origPattern:
            print("
There is an exact match at: [{}:{}] for {}").format(*startIndices[0])
            return

    print('Used {} patterns:').format(len(patterns))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}').format(i, p)

    # list for all non-overlapping starting indices
    nonOverlapping = []
    # hold the last matches ending position
    lastEnd = 0
    # find non-overlapping matches by comparing each matches starting index to the previous matches ending index
    # if the starting index > previous items ending index they aren't overlapping
    for start in sorted(startIndices):
        print(start)
        if start[0] >= lastEnd:
            # startIndicex[start][0] gets the ending index from the current matches tuple
            lastEnd = start[1]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:').format(len(startIndices))
    # i is the key  assigned to the value of the indices (, 
    for start in sorted(startIndices):
        # *startIndices[i] means to unpack the tuple associated to the key i's value to be used by format as 2 inputs
        # for explanation, see: http://stackoverflow.com/questions/2921847/what-does-the-star-operator-mean-in-python
        print('- [{}:{}]  {}').format(*start)

    print()
    print('Found {} non-overlapping matches:').format(len(nonOverlapping))
    for ov in nonOverlapping:
        print('- [{}:{}]  {}').format(*ov)

    end = time.time()
    print(end-begin)

def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes. (replace)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # replace with a wildcard for each index
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions. (omitted)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # remove a char for each index
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        # insert a wildcard between adjacent chars for each index
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

ORIGINAL: http://pastebin.com/bAXeYZcD - the actual function

http://pastebin.com/YSfD00Ju - data to use, should be 8 matches for 'ware' but only gets 6

http://pastebin.com/S9u50ig0 - data to use, should get 85 matches for 'etnse' but only gets 77

I left all of the original code in the function because I'm not sure exactly what is causing the problem.

you can search for 'Board:isFull()' on anything to get the error stated below.

examples:

assume you named the second pastebin 'someFile.txt' in a folder named files in the same directory as the .py file.

file = open('./files/someFile.txt', 'r')
doMatching(file, "ware")

OR

file = open('./files/someFile.txt', 'r')
doMatching(file, "Board:isFull()")

OR

assume you named the third pastebin 'dictionary.txt' in a folder named files in the same directory as the .py file.

file = open('./files/dictionary.txt', 'r')
doMatching(file, "etnse")

--EDIT

The functions parameters work like so:

file is the location of a file.

origPattern is a phrase.

The function is basically supposed to be a fuzzy search. It's supposed to take the pattern and search through a file to find matches that are either exact, or with a 1 character deviation. i.e.: 1 missing character, 1 extra character, 1 replaced character, or 1 character swapped with an adjacent character.

For the most part it works, But i'm running into a few problems.

First, when I try to use something like 'Board:isFull()' for origPattern I get the following:

    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

the above is from the re library

I've tried using re.escape() but it doesn't change anything.

Second, when I try some other things like 'Fun()' it says it has a match at some index that doesn't even contain any of that; it's just a line of '*'

Third, When it does find matches it doesn't always find all of the matches. For example, there's one file I have that should find 85 matches, but it only comes up with like 77, and another with 8 but it only comes up with 6. However, they are just alphabetical so it's likely only a problem with how I do searching or something.

Any help is appreciated.

I also can't use fuzzyfinder

Norman · Accepted Answer

I found some issues in the code:

re.escape() seems to not work because its result is not assigned.
Do origPattern = re.escape(origPattern).
When pattern is correctly escaped, be mindful of not breaking the escaping when manipulating the pattern.
Example: re.escape('Fun()') yields the string Fun. The two $ substrings in it must never be separated: never remove, replace, or swap a \ without the char it escapes.
Bad manipulations: Fun($ (removal), Fu\n(\) (swap), Fun\.{0,2}\).
Good manipulations: Fun\) (removal), Fu$n$ (swap), Fun.{0,2}\).
You find too few matches because you only try to find fuzzy matches if there are no exact matches. (See line if indices.__len__() != 0:.) You must always look for them.
The loops inserting '.{0,2}' produce one too many pattern, e.g. 'ware.{0,2}' for ware. Unless you intend that, this pattern will find wareXY which has two insertions.
The patterns with .{0,2} don't work as described; they allow one change and one insertion.
I'm not sure about the code involving difflib.Differ. I don't understand it, but I suspect there should be no break statements.
Even though you use a set to store indices, matches from different regexes may still overlap.
You don't use word boundaries (\b) in your regexes, though for natural language that would make sense.
Not a bug, but: Why do you call magic methods explicitly?
(E.g. indices.__len__() != 0 instead of len(indices) != 0.)

I rewrote your code a bit to address any issues I saw:

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = {}

    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        startIndices.update((m.start(), (m.end(), m.group())) for m in re.finditer(pattern, entireFile))

    print('Used {} patterns:'.format(len(patterns)))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}'.format(i, p))

    nonOverlapping = []
    lastEnd = 0
    for start in sorted(startIndices):
        if start >= lastEnd:
            lastEnd = startIndices[start][0]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:'.format(len(startIndices)))
    for i in sorted(startIndices):
        print('- [{}:{}]  {}'.format(i, *startIndices[i]))

    print()
    print('Found {} non-overlapping matches:'.format(len(nonOverlapping)))
    for i in nonOverlapping:
        print('- [{}:{}]  {}'.format(i, *startIndices[i]))


def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes.
    for i in range(len(origPattern)):
        t = origPattern[:]
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions.
    for i in range(len(origPattern)):
        t = origPattern[:]
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

Python regex problems with string matching

Answers (1)

Related Questions