Darrel Holt
Darrel Holt

Reputation: 910

Python regex problems with string matching

Sorry in advance for such a long post

EDIT--

Modified from Norman's Solution to print and return if we find an exact solution, otherwise print all approximate matches. It's currently still only getting 83/85 matches for a specific example of searching for etnse on the dictionary file provided below on the third pastebin link.

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = []

    begin = time.time()

    # get all of the patterns associated with the given phrase
    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        for m in re.finditer(pattern, entireFile):
            startIndices.append((m.start(), m.end(), m.group()))
        # if the first pattern(exact match) is valid, then just print the results and we're done
        if len(startIndices) != 0 and startIndices[0][2] == origPattern:
            print("\nThere is an exact match at: [{}:{}] for {}").format(*startIndices[0])
            return

    print('Used {} patterns:').format(len(patterns))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}').format(i, p)

    # list for all non-overlapping starting indices
    nonOverlapping = []
    # hold the last matches ending position
    lastEnd = 0
    # find non-overlapping matches by comparing each matches starting index to the previous matches ending index
    # if the starting index > previous items ending index they aren't overlapping
    for start in sorted(startIndices):
        print(start)
        if start[0] >= lastEnd:
            # startIndicex[start][0] gets the ending index from the current matches tuple
            lastEnd = start[1]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:').format(len(startIndices))
    # i is the key <starting index> assigned to the value of the indices (<ending index>, <string at those indices>
    for start in sorted(startIndices):
        # *startIndices[i] means to unpack the tuple associated to the key i's value to be used by format as 2 inputs
        # for explanation, see: http://stackoverflow.com/questions/2921847/what-does-the-star-operator-mean-in-python
        print('- [{}:{}]  {}').format(*start)

    print()
    print('Found {} non-overlapping matches:').format(len(nonOverlapping))
    for ov in nonOverlapping:
        print('- [{}:{}]  {}').format(*ov)

    end = time.time()
    print(end-begin)

def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes. (replace)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # replace with a wildcard for each index
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions. (omitted)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # remove a char for each index
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        # insert a wildcard between adjacent chars for each index
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

ORIGINAL: http://pastebin.com/bAXeYZcD - the actual function

http://pastebin.com/YSfD00Ju - data to use, should be 8 matches for 'ware' but only gets 6

http://pastebin.com/S9u50ig0 - data to use, should get 85 matches for 'etnse' but only gets 77

I left all of the original code in the function because I'm not sure exactly what is causing the problem.

you can search for 'Board:isFull()' on anything to get the error stated below.

examples:

assume you named the second pastebin 'someFile.txt' in a folder named files in the same directory as the .py file.

file = open('./files/someFile.txt', 'r')
doMatching(file, "ware")

OR

file = open('./files/someFile.txt', 'r')
doMatching(file, "Board:isFull()")

OR

assume you named the third pastebin 'dictionary.txt' in a folder named files in the same directory as the .py file.

file = open('./files/dictionary.txt', 'r')
doMatching(file, "etnse")

--EDIT

The functions parameters work like so:

file is the location of a file.

origPattern is a phrase.

The function is basically supposed to be a fuzzy search. It's supposed to take the pattern and search through a file to find matches that are either exact, or with a 1 character deviation. i.e.: 1 missing character, 1 extra character, 1 replaced character, or 1 character swapped with an adjacent character.

For the most part it works, But i'm running into a few problems.

First, when I try to use something like 'Board:isFull()' for origPattern I get the following:

    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

the above is from the re library

I've tried using re.escape() but it doesn't change anything.

Second, when I try some other things like 'Fun()' it says it has a match at some index that doesn't even contain any of that; it's just a line of '*'

Third, When it does find matches it doesn't always find all of the matches. For example, there's one file I have that should find 85 matches, but it only comes up with like 77, and another with 8 but it only comes up with 6. However, they are just alphabetical so it's likely only a problem with how I do searching or something.

Any help is appreciated.

I also can't use fuzzyfinder

Upvotes: 1

Views: 626

Answers (1)

Norman
Norman

Reputation: 1975

I found some issues in the code:

  1. re.escape() seems to not work because its result is not assigned.
    Do origPattern = re.escape(origPattern).
  2. When pattern is correctly escaped, be mindful of not breaking the escaping when manipulating the pattern.
    Example: re.escape('Fun()') yields the string Fun\(\). The two \( substrings in it must never be separated: never remove, replace, or swap a \ without the char it escapes.
    Bad manipulations: Fun(\) (removal), Fu\n(\) (swap), Fun\.{0,2}\).
    Good manipulations: Fun\) (removal), Fu\(n\) (swap), Fun.{0,2}\).
  3. You find too few matches because you only try to find fuzzy matches if there are no exact matches. (See line if indices.__len__() != 0:.) You must always look for them.
  4. The loops inserting '.{0,2}' produce one too many pattern, e.g. 'ware.{0,2}' for ware. Unless you intend that, this pattern will find wareXY which has two insertions.
  5. The patterns with .{0,2} don't work as described; they allow one change and one insertion.
  6. I'm not sure about the code involving difflib.Differ. I don't understand it, but I suspect there should be no break statements.
  7. Even though you use a set to store indices, matches from different regexes may still overlap.
  8. You don't use word boundaries (\b) in your regexes, though for natural language that would make sense.
  9. Not a bug, but: Why do you call magic methods explicitly?
    (E.g. indices.__len__() != 0 instead of len(indices) != 0.)

I rewrote your code a bit to address any issues I saw:

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = {}

    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        startIndices.update((m.start(), (m.end(), m.group())) for m in re.finditer(pattern, entireFile))

    print('Used {} patterns:'.format(len(patterns)))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}'.format(i, p))

    nonOverlapping = []
    lastEnd = 0
    for start in sorted(startIndices):
        if start >= lastEnd:
            lastEnd = startIndices[start][0]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:'.format(len(startIndices)))
    for i in sorted(startIndices):
        print('- [{}:{}]  {}'.format(i, *startIndices[i]))

    print()
    print('Found {} non-overlapping matches:'.format(len(nonOverlapping)))
    for i in nonOverlapping:
        print('- [{}:{}]  {}'.format(i, *startIndices[i]))


def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes.
    for i in range(len(origPattern)):
        t = origPattern[:]
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions.
    for i in range(len(origPattern)):
        t = origPattern[:]
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

Upvotes: 1

Related Questions