Reputation: 910
Sorry in advance for such a long post
EDIT--
Modified from Norman's Solution to print and return if we find an exact solution, otherwise print all approximate matches. It's currently still only getting 83/85 matches for a specific example of searching for etnse on the dictionary file provided below on the third pastebin link.
def doMatching(file, origPattern):
entireFile = file.read()
patterns = []
startIndices = []
begin = time.time()
# get all of the patterns associated with the given phrase
for pattern in generateFuzzyPatterns(origPattern):
patterns.append(pattern)
for m in re.finditer(pattern, entireFile):
startIndices.append((m.start(), m.end(), m.group()))
# if the first pattern(exact match) is valid, then just print the results and we're done
if len(startIndices) != 0 and startIndices[0][2] == origPattern:
print("\nThere is an exact match at: [{}:{}] for {}").format(*startIndices[0])
return
print('Used {} patterns:').format(len(patterns))
for i, p in enumerate(patterns, 1):
print('- [{}] {}').format(i, p)
# list for all non-overlapping starting indices
nonOverlapping = []
# hold the last matches ending position
lastEnd = 0
# find non-overlapping matches by comparing each matches starting index to the previous matches ending index
# if the starting index > previous items ending index they aren't overlapping
for start in sorted(startIndices):
print(start)
if start[0] >= lastEnd:
# startIndicex[start][0] gets the ending index from the current matches tuple
lastEnd = start[1]
nonOverlapping.append(start)
print()
print('Found {} matches:').format(len(startIndices))
# i is the key <starting index> assigned to the value of the indices (<ending index>, <string at those indices>
for start in sorted(startIndices):
# *startIndices[i] means to unpack the tuple associated to the key i's value to be used by format as 2 inputs
# for explanation, see: http://stackoverflow.com/questions/2921847/what-does-the-star-operator-mean-in-python
print('- [{}:{}] {}').format(*start)
print()
print('Found {} non-overlapping matches:').format(len(nonOverlapping))
for ov in nonOverlapping:
print('- [{}:{}] {}').format(*ov)
end = time.time()
print(end-begin)
def generateFuzzyPatterns(origPattern):
# Escape individual symbols.
origPattern = [re.escape(c) for c in origPattern]
# Find exact matches.
pattern = ''.join(origPattern)
yield pattern
# Find matches with changes. (replace)
for i in range(len(origPattern)):
t = origPattern[:]
# replace with a wildcard for each index
t[i] = '.'
pattern = ''.join(t)
yield pattern
# Find matches with deletions. (omitted)
for i in range(len(origPattern)):
t = origPattern[:]
# remove a char for each index
t[i] = ''
pattern = ''.join(t)
yield pattern
# Find matches with insertions.
for i in range(len(origPattern) + 1):
t = origPattern[:]
# insert a wildcard between adjacent chars for each index
t.insert(i, '.')
pattern = ''.join(t)
yield pattern
# Find two adjacent characters being swapped.
for i in range(len(origPattern) - 1):
t = origPattern[:]
if t[i] != t[i + 1]:
t[i], t[i + 1] = t[i + 1], t[i]
pattern = ''.join(t)
yield pattern
ORIGINAL: http://pastebin.com/bAXeYZcD - the actual function
http://pastebin.com/YSfD00Ju - data to use, should be 8 matches for 'ware' but only gets 6
http://pastebin.com/S9u50ig0 - data to use, should get 85 matches for 'etnse' but only gets 77
I left all of the original code in the function because I'm not sure exactly what is causing the problem.
you can search for 'Board:isFull()' on anything to get the error stated below.
examples:
assume you named the second pastebin 'someFile.txt' in a folder named files in the same directory as the .py file.
file = open('./files/someFile.txt', 'r')
doMatching(file, "ware")
OR
file = open('./files/someFile.txt', 'r')
doMatching(file, "Board:isFull()")
OR
assume you named the third pastebin 'dictionary.txt' in a folder named files in the same directory as the .py file.
file = open('./files/dictionary.txt', 'r')
doMatching(file, "etnse")
--EDIT
The functions parameters work like so:
file is the location of a file.
origPattern is a phrase.
The function is basically supposed to be a fuzzy search. It's supposed to take the pattern and search through a file to find matches that are either exact, or with a 1 character deviation. i.e.: 1 missing character, 1 extra character, 1 replaced character, or 1 character swapped with an adjacent character.
For the most part it works, But i'm running into a few problems.
First, when I try to use something like 'Board:isFull()' for origPattern I get the following:
raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis
the above is from the re library
I've tried using re.escape() but it doesn't change anything.
Second, when I try some other things like 'Fun()' it says it has a match at some index that doesn't even contain any of that; it's just a line of '*'
Third, When it does find matches it doesn't always find all of the matches. For example, there's one file I have that should find 85 matches, but it only comes up with like 77, and another with 8 but it only comes up with 6. However, they are just alphabetical so it's likely only a problem with how I do searching or something.
Any help is appreciated.
I also can't use fuzzyfinder
Upvotes: 1
Views: 626
Reputation: 1975
I found some issues in the code:
re.escape()
seems to not work because its result is not assigned.origPattern = re.escape(origPattern)
. re.escape('Fun()')
yields the string Fun\(\)
. The two \(
substrings in it must never be separated: never remove, replace, or swap a \
without the char it escapes.Fun(\)
(removal), Fu\n(\)
(swap), Fun\.{0,2}\)
.Fun\)
(removal), Fu\(n\)
(swap), Fun.{0,2}\)
.if indices.__len__() != 0:
.) You must always look for them.'.{0,2}'
produce one too many pattern, e.g. 'ware.{0,2}'
for ware
. Unless you intend that, this pattern will find wareXY
which has two insertions..{0,2}
don't work as described; they allow one change and one insertion.difflib.Differ
. I don't understand it, but I suspect there should be no break
statements.set
to store indices, matches from different regexes may still overlap.\b
) in your regexes, though for natural language that would make sense.indices.__len__() != 0
instead of len(indices) != 0
.)I rewrote your code a bit to address any issues I saw:
def doMatching(file, origPattern):
entireFile = file.read()
patterns = []
startIndices = {}
for pattern in generateFuzzyPatterns(origPattern):
patterns.append(pattern)
startIndices.update((m.start(), (m.end(), m.group())) for m in re.finditer(pattern, entireFile))
print('Used {} patterns:'.format(len(patterns)))
for i, p in enumerate(patterns, 1):
print('- [{}] {}'.format(i, p))
nonOverlapping = []
lastEnd = 0
for start in sorted(startIndices):
if start >= lastEnd:
lastEnd = startIndices[start][0]
nonOverlapping.append(start)
print()
print('Found {} matches:'.format(len(startIndices)))
for i in sorted(startIndices):
print('- [{}:{}] {}'.format(i, *startIndices[i]))
print()
print('Found {} non-overlapping matches:'.format(len(nonOverlapping)))
for i in nonOverlapping:
print('- [{}:{}] {}'.format(i, *startIndices[i]))
def generateFuzzyPatterns(origPattern):
# Escape individual symbols.
origPattern = [re.escape(c) for c in origPattern]
# Find exact matches.
pattern = ''.join(origPattern)
yield pattern
# Find matches with changes.
for i in range(len(origPattern)):
t = origPattern[:]
t[i] = '.'
pattern = ''.join(t)
yield pattern
# Find matches with deletions.
for i in range(len(origPattern)):
t = origPattern[:]
t[i] = ''
pattern = ''.join(t)
yield pattern
# Find matches with insertions.
for i in range(len(origPattern) + 1):
t = origPattern[:]
t.insert(i, '.')
pattern = ''.join(t)
yield pattern
# Find two adjacent characters being swapped.
for i in range(len(origPattern) - 1):
t = origPattern[:]
if t[i] != t[i + 1]:
t[i], t[i + 1] = t[i + 1], t[i]
pattern = ''.join(t)
yield pattern
Upvotes: 1