Vladimir Keleshev
Vladimir Keleshev

Reputation: 14245

How to improve performance of this recursive function?

I'm trying to write a function which will search str for a substr, taking into account different possibilities to write weird letters, such as æ, ø, å in danish language. For example you can search 'Ålborg' and the function will return true if there is, say 'Aalborg' in the str.

The function below works, but performance is unbearable. What would you recommend to improve performance?

def danish_tolerating_search(substr, str):
    '''Figure out if substr is in str, taking into account
    possible deviations in writing letters æ, ø, å.
    æ  <->  ae a ea
    ø  <->  oe o
    å  <->  aa a o
    '''

    # normalize input
    substr = substr.lower().replace('aa',u'å')
    str = str.lower()

    # normalized recursive search
    # TODO fix perfomance
    def s(substr, str):
        if str.find(substr) >= 0: return True
        if substr.find(u'æ') >= 0:
            if    s(substr.replace(u'æ','ae', 1), str): return True
            elif  s(substr.replace(u'æ', 'a', 1), str): return True
            elif  s(substr.replace(u'æ','ea', 1), str): return True
        if str.find(u'æ') >= 0:
            if    s(substr, str.replace(u'æ','ae', 1)): return True
            elif  s(substr, str.replace(u'æ', 'a', 1)): return True
            elif  s(substr, str.replace(u'æ','ea', 1)): return True
        if substr.find(u'ø') >= 0:
            if    s(substr.replace(u'ø','oe', 1), str): return True
            elif  s(substr.replace(u'ø', 'o', 1), str): return True
        if str.find(u'ø') >= 0:
            if    s(substr, str.replace(u'ø','oe', 1)): return True
            elif  s(substr, str.replace(u'ø', 'o', 1)): return True
        if substr.find(u'å') >= 0:
            if    s(substr.replace(u'å','aa', 1), str): return True
            elif  s(substr.replace(u'å', 'a', 1), str): return True
            elif  s(substr.replace(u'å', 'o', 1), str): return True
        if str.find(u'å') >= 0:
            if    s(substr, str.replace(u'å','aa', 1)): return True
            elif  s(substr, str.replace(u'å', 'a', 1)): return True
            elif  s(substr, str.replace(u'å', 'o', 1)): return True
        return False

    return s(substr, str)

Upvotes: 4

Views: 488

Answers (3)

kennytm
kennytm

Reputation: 523214

Try

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def danish_tolerating_search(search, content):
    search = search.lower()
    content = content.lower()

    variants = {
        u'a': u'[aæå]',
        u'o': u'[oøå]',
        u'ae': u'(?:ae|æ)',
        u'ea': u'(?:ea|æ)',
        u'aa': u'(?:aa|å)',
        u'oe': u'(?:oe|ø)',
        u'\\å': u'(?:[oå]|aa?)',
        u'\\ø': u'(?:ø|oe?)',
        u'\\æ': u'(?:æ|ae?|ea)',
    }

    search = re.escape(search)
    search = re.sub(ur'[ae]a|[ao]e?|\\[åøæ]', lambda m: variants[m.group(0)], search)
    return re.search(search, content) is not None

I did not test its performance because OP did not release any. I'll just assume the regex engine is better optimized than calling OP's s() recursively and doing lots of .find and .replace.

Here, the key letters in the search string are replaced by the possible equivalent classes in terms of regex, e.g. Ålborg becomes (?:[oå]|aa?)lb[oøå]rg. This regex should include all possible variants equivalent to Ålborg (ålbørg" or "ålbårg" or "aalborg" or "aalbørg" or "aalbårg" or "alborg" or "albørg" or "albårg" as mentioned by @101100's comment). Then the regex is simply searched against the context.

Upvotes: 3

jena
jena

Reputation: 8506

I think you should eliminate the recursion altogether. Instead of doing all that find and replace, you could, for example, decide upon a "normal form" of your input strings, convert them accordingly (i.e. replace those "ambiguous" characters) and do a simple

return substring in string_

Note also that you don't need to call find and replace together, the latter sufficies. If the search string isn't found, replace just won't replace anything.

Upvotes: 3

Spencer Rathbun
Spencer Rathbun

Reputation: 14900

This is a classic example of a parser. Read up on things like lex and yacc, you won't need all of their functionality, but the principles still apply.

After that, use the python re module to match against the appropriate regular expressions. If you need more functionality, use the pyparsing libraries.

def danish_tolerating_search(substr, str):
'''Figure out if substr is in str, taking into account
possible deviations in writing letters æ, ø, å.
æ  <->  ae a ea
ø  <->  oe o
å  <->  aa a o
for all of these combinations replace with appropriate regex as in example
'''
substring = substring.lower().replace('aa', '[ae]{1,2}')
string = string.lower()
re.search(substring, string)

Upvotes: 1

Related Questions