Reputation: 14245
I'm trying to write a function which will search str for a substr, taking into account different possibilities to write weird letters, such as æ, ø, å in danish language. For example you can search 'Ålborg' and the function will return true if there is, say 'Aalborg' in the str.
The function below works, but performance is unbearable. What would you recommend to improve performance?
def danish_tolerating_search(substr, str):
'''Figure out if substr is in str, taking into account
possible deviations in writing letters æ, ø, å.
æ <-> ae a ea
ø <-> oe o
å <-> aa a o
'''
# normalize input
substr = substr.lower().replace('aa',u'å')
str = str.lower()
# normalized recursive search
# TODO fix perfomance
def s(substr, str):
if str.find(substr) >= 0: return True
if substr.find(u'æ') >= 0:
if s(substr.replace(u'æ','ae', 1), str): return True
elif s(substr.replace(u'æ', 'a', 1), str): return True
elif s(substr.replace(u'æ','ea', 1), str): return True
if str.find(u'æ') >= 0:
if s(substr, str.replace(u'æ','ae', 1)): return True
elif s(substr, str.replace(u'æ', 'a', 1)): return True
elif s(substr, str.replace(u'æ','ea', 1)): return True
if substr.find(u'ø') >= 0:
if s(substr.replace(u'ø','oe', 1), str): return True
elif s(substr.replace(u'ø', 'o', 1), str): return True
if str.find(u'ø') >= 0:
if s(substr, str.replace(u'ø','oe', 1)): return True
elif s(substr, str.replace(u'ø', 'o', 1)): return True
if substr.find(u'å') >= 0:
if s(substr.replace(u'å','aa', 1), str): return True
elif s(substr.replace(u'å', 'a', 1), str): return True
elif s(substr.replace(u'å', 'o', 1), str): return True
if str.find(u'å') >= 0:
if s(substr, str.replace(u'å','aa', 1)): return True
elif s(substr, str.replace(u'å', 'a', 1)): return True
elif s(substr, str.replace(u'å', 'o', 1)): return True
return False
return s(substr, str)
Upvotes: 4
Views: 488
Reputation: 523214
Try
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
def danish_tolerating_search(search, content):
search = search.lower()
content = content.lower()
variants = {
u'a': u'[aæå]',
u'o': u'[oøå]',
u'ae': u'(?:ae|æ)',
u'ea': u'(?:ea|æ)',
u'aa': u'(?:aa|å)',
u'oe': u'(?:oe|ø)',
u'\\å': u'(?:[oå]|aa?)',
u'\\ø': u'(?:ø|oe?)',
u'\\æ': u'(?:æ|ae?|ea)',
}
search = re.escape(search)
search = re.sub(ur'[ae]a|[ao]e?|\\[åøæ]', lambda m: variants[m.group(0)], search)
return re.search(search, content) is not None
I did not test its performance because OP did not release any. I'll just assume the regex engine is better optimized than calling OP's s()
recursively and doing lots of .find
and .replace
.
Here, the key letters in the search string are replaced by the possible equivalent classes in terms of regex, e.g. Ålborg
becomes (?:[oå]|aa?)lb[oøå]rg
. This regex should include all possible variants equivalent to Ålborg
(ålbørg" or "ålbårg" or "aalborg" or "aalbørg" or "aalbårg" or "alborg" or "albørg" or "albårg" as mentioned by @101100's comment). Then the regex is simply searched against the context.
Upvotes: 3
Reputation: 8506
I think you should eliminate the recursion altogether. Instead of doing all that find
and replace
, you could, for example, decide upon a "normal form" of your input strings, convert them accordingly (i.e. replace those "ambiguous" characters) and do a simple
return substring in string_
Note also that you don't need to call find
and replace
together, the latter sufficies. If the search string isn't found, replace just won't replace anything.
Upvotes: 3
Reputation: 14900
This is a classic example of a parser. Read up on things like lex and yacc, you won't need all of their functionality, but the principles still apply.
After that, use the python re module to match against the appropriate regular expressions. If you need more functionality, use the pyparsing libraries.
def danish_tolerating_search(substr, str):
'''Figure out if substr is in str, taking into account
possible deviations in writing letters æ, ø, å.
æ <-> ae a ea
ø <-> oe o
å <-> aa a o
for all of these combinations replace with appropriate regex as in example
'''
substring = substring.lower().replace('aa', '[ae]{1,2}')
string = string.lower()
re.search(substring, string)
Upvotes: 1