Reputation: 5117

Replace multiple (special) characters - most efficient way?

At the texts that I have, I want to replace the following special characters with a single space:

symbols = ["`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+", "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"]

What is the most efficient way (in terms of time of code execution) to do this?

For example, I want this:

(Hello World)] *!

to become this:

Hello World

The candidate methods seem to be the following:

list comprehension
.replace()
.translate()
regular expressions

Upvotes: 5

Answers (6)

okainov

Reputation: 4654

While Roght's answer is the best IMO and it shows the objective approach, I'd like to notice that translate is not always the best! You really need to check it out yourself, the result will depend on your inputs.

A bit of complexity theory

(disclaimer: I haven't looked into Python source code, so below is what I would expect), given we have K symbols to replace and N symbols in source string:

str.replace should basically iterate over the whole string checking every symbol and replace it if it's matching the parameter. Looks like pure O(N) , thus for K replacements it will be O(K*N).

On the other hand, translate should iterate over the whole string just once checking every symbol for a match in translation table. As translation table is a hashmap, lookup there is O(1), thus the whole translation doesn't depend on K at all, should be O(N)

Question - why is then replace faster in my case??? I don't know :(

I came around this when I was refactoring our script analyzing test logs (quite big files, think 60Mb+) and it was cleaning it up from some random symbols as well as doing some HTML-sanitization, here is the replacement dictionary:

replace_dict = {
        "&": "&amp;",
        "\"": "&quot;",
        "<": "&lt;",
        ">": "&gt;",
        "\u0000": "",
        "\u0007": "",
        "\u0008": "",
        "\u001a": "",
        "\u001b": "",
    }

When I saw initial code just having 9 str.replace calls in the row this was my first thought - "wtf, let's use translate instead", this must be much faster. However in my case I found out that replace is actually the fastest method.

Test script:

replace_dict = {
    "&": "&amp;",
    "\"": "&quot;",
    "<": "&lt;",
    ">": "&gt;",
    "\u0000": "",
    "\u0007": "",
    "\u0008": "",
    "\u001a": "",
    "\u001b": "",
}

symbols = list(replace_dict.keys())
translate_table = {ord(k): v if v else None for k, v in replace_dict.items()}
with open("myhuge.log") as f:
    big_document = f.read()


def func_replace(doc):
    for k, v in replace_dict.items():
        doc = doc.replace(k, v)
    return doc


def func_trans(doc):
    return doc.translate(translate_table)


def func_list_comp(doc):
    # That's not really equivalent to two methods above, but still good for perf comparison
    return "".join(c for c in doc if c not in symbols)


if __name__ == '__main__':
    import timeit
    number = 5
    print("func_replace(big_document): ", timeit.timeit("func_replace(big_document)",
          setup="from __main__ import func_replace, big_document", number=number))

    print("func_trans(big_document): ", timeit.timeit("func_trans(big_document)",
          setup="from __main__ import func_trans, big_document", number=number))

    print("func_list_comp(big_document): ", timeit.timeit("func_list_comp(big_document)",
          setup="from __main__ import func_list_comp, big_document", number=number))

So here are the results:

func_replace(big_document): 4.945449151098728

func_trans(big_document): 15.22288554534316

func_list_comp(big_document): 45.01621600985527

I can make two conclusions out of it:

List comprehension is really slow, don't use it.
Counter-intuitively, replace can be few times faster than translate for some cases. If your replacement table is not too big and the strings you're working on are too big, seems like replace would be better.

Upvotes: 0

Alain T.

Reputation: 42143

str.translate() is indeed the fastest method. Here's a concise way to build the translation table for exclusion of characters:

symbols = ["`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+", "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"]
removeSymbols = str.maketrans("","","".join(symbols))

cleanText = "[Hello World] *!".translate(removeSymbols)
print(cleanText) # "Hello World "

The maketrans() functions can take 3 parameters, the first one is a string with the characters to replace, the second one is their replacements and the third one is a list of characters that should be removed. To bluntly remove all characters, we just need to supply the 3rd parameter with a string containing the symbols to remove.

The translation table removeSymbols then performs a complete removal of the characters in the symbols list.

To replace with spaces, build the translation table like this:

removeSymbols = str.maketrans("".join(symbols)," "*len(symbols))

Upvotes: 0

yatu

Reputation: 88236

For an efficient solution you could use str.maketrans for this. Note that once the translation table is defined, it's onle a matter of mapping the characters in the string. Here's how you could do so:

symbols = ["`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+",
           "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"]

Start by creating a dictionary from the symbols using dict.fromkeys setting a single space as value for each entry and create a translation table from the dictionary:

d = dict.fromkeys(''.join(symbols), ' ')
# {'`': ' ', ',': ' ', '~': ' ', '!': ' ', '@': ' '...
t = str.maketrans(d)

Then call the string translate method to map the characters in the above dictionary with an empty space:

s = '~this@is!a^test@'
s.translate(t)
# ' this is a test '

Upvotes: 8

Olvin Roght

Reputation: 7812

After launching some tests, I can say that str.translate() is the best variant.

Input data:

symbols = {"`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+", "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"}
translate_table = {126: None, 93: None, 91: None, 125: None, 92: None, 42: None, 45: None, 94: None, 62: None, 47: None, 35: None, 59: None, 44: None, 58: None, 60: None, 124: None, 61: None, 36: None, 95: None, 43: None, 96: None, 123: None, 64: None, 33: None, 38: None, 63: None, 46: None, 34: None, 41: None, 37: None, 40: None}
regular_expression = "[`~!@#$%^&*()_\-+={[\]}|\\:;\"<,>.?/]"
small_document = "Some**r@an]]\"dom t##xt"
normal_document = "TbsX^Kt$FZ%haZe+sLxu:Al\"xNAL\\Kix[mHp_gn]PrG`DqGd~GdNc;BoEq.SYD?Rp>ukq,UfO<XdTc=RUH}oifc&oP!CB*me@Qv{Qf-Li)gmXL/IQH#mne(Khaj|"
big_document = "QOfY+dymyoGBAxTAoIeM+jEWlaECUZEUXuMvprJOqFtQR*OiHtTFZkUNbYipSTTDPOVkIdGTcjWrQmbmthKBHBSEOZ)lQAIJOrVgmGGFdtqbuFfj<Dls<JWtKczAFMPYMemiJBJHdPeeul\\x>lGIBvUsxBokagvVovrrdxdKMtAKx>MEexYv>DGqPUXYaBQKwiSIUobrPQYjilhHMQunE;RiqOZPTnyOEgRrpxcuobvvmGkFpTqgMxYYhrmRRnauiqgvCmZ\"UauceaXsgAMSakxewzPrlIrYkVCVZaEGh]qiizYyzbkcHPF@qQsQMfHPDEbEnWtrCFoARUYAloOcctqmL@hegZbfhsHaJOxOxzQhZAVjVDgokosATfhKMT!WYyPWKcKAHKCzQGGJOCglYGZbftsuyntXZUKNqgGlsLJqgN,pUcOoA/tStXFXgpoSErgvw/OUMPWjJwt=bhMAIDayOZXJm=ifYYUuAvSIZjwnBfktNvEvZmvQso%HiNZEVqoDR%nQBtCkhjSfVfDuRSRsvp-sCunjDDUYSEVLICQdisxhEfqkUTkiPlLiUNNwrvO#WTDmweZyMeIbgNXkIsvaJeHYXV(HvRcGNZM(PPRIAyyLWivGiqMVBtwObqLfEEISyyjGNEdUU:ys`dXcVawkIEAjFXky`RUXNTm`LDM}mwTOcmsSo}haJXPnkwOhKLYwve}SWifzKq}grw}fMSQXXWguUQtlWpPZQymR^wBKEyolFlZnzEEmehSNenOqDOHWRit[Npm?R?DIPXAmQYYBbmJofxUzzWBsVCoPI?VmpXhoMxCfXyHEHowXzIJvExThiffLhBTtma_jk_NrbkPCGGypXvOuBqBxDYfC{bwIHoaqnJSKytxwWXBNnKG~PKuQklGblEwH~rJoGpKZmm~tTEFnPLdmzfrqJibMYIykzL$RZLPmsZjB$AAbZwFnByOydEOIfFvTaEQaSjbpeBZuUGY&ZfPQgLihmPYrhZxSwMzLrNF.WjFiDCLyXksdkLeMHVCfrdgCAotElQ|"
no_match_document = "XOtasggWqhtSLJpHEGoCmMRepFBlRfAGKTLPcEtKonFVsPgvWgAbvJVeMWILPgLapwAmTgXWVbxOJtUFmMygzIqYPqyAxzwElTFyYcGdtnNa"

Code:

def func1(doc):
    for c in symbols:
        doc = doc.replace(c, "")
    return doc


def func2(doc):
    return doc.translate(translate_table)


def func3(doc):
    return re.sub(regular_expression, "", doc)


def func4(doc):
    return "".join(c for c in doc if c not in symbols)

Test results:

func1(small_document):      0.701037002
func1(normal_document):     1.1260866900000002
func1(big_document):        3.4234831459999997
func1(no_match_document):   0.7740780450000004

func2(small_document):      0.14135037500000003
func2(normal_document):     0.5368806810000004
func2(big_document):        0.8128472860000002
func2(no_match_document):   0.394245089

func3(small_document):      0.3157141610000007
func3(normal_document):     0.927359323000001
func3(big_document):        1.9310377590000005
func3(no_match_document):   0.18656399199999996

func4(small_document):      0.3034549070000008
func4(normal_document):     1.3695875739999988
func4(big_document):        10.115730064
func4(no_match_document):   1.2086623230000022

UPD.

Input data I've provided have been "prepared" specially for pure method testing.

To generate translate_table I've used next dict comprehension:

translate_table = {ord(s): None for s in symbols}

Here is link to website for regex validation (it could be helpful).

In case if you want to recalculate tests by yourself, here is code:

    if __name__ == '__main__':
    import timeit
    print("func1(small_document)", timeit.timeit("func1(small_document)", setup="from __main__ import func1, small_document", number=100000))
    print("func1(normal_document): ", timeit.timeit("func1(normal_document)", setup="from __main__ import func1, normal_document", number=100000))
    print("func1(big_document): ", timeit.timeit("func1(big_document)", setup="from __main__ import func1, big_document", number=100000))
    print("func1(no_match_document): ", timeit.timeit("func1(no_match_document)", setup="from __main__ import func1, no_match_document", number=100000))

    print("func2(small_document): ", timeit.timeit("func2(small_document)", setup="from __main__ import func2, small_document", number=100000))
    print("func2(normal_document): ", timeit.timeit("func2(normal_document)", setup="from __main__ import func2, normal_document", number=100000))
    print("func2(big_document): ", timeit.timeit("func2(big_document)", setup="from __main__ import func2, big_document", number=100000))
    print("func2(no_match_document): ", timeit.timeit("func2(no_match_document)", setup="from __main__ import func2, no_match_document", number=100000))

    print("func3(small_document): ", timeit.timeit("func3(small_document)", setup="from __main__ import func3, small_document", number=100000))
    print("func3(normal_document): ", timeit.timeit("func3(normal_document)", setup="from __main__ import func3, normal_document", number=100000))
    print("func3(big_document): ", timeit.timeit("func3(big_document)", setup="from __main__ import func3, big_document", number=100000))
    print("func3(no_match_document): ", timeit.timeit("func3(no_match_document)", setup="from __main__ import func3, no_match_document", number=100000))

    print("func4(small_document): ", timeit.timeit("func4(small_document)", setup="from __main__ import func4, small_document", number=100000))
    print("func4(normal_document): ", timeit.timeit("func4(normal_document)", setup="from __main__ import func4, normal_document", number=100000))
    print("func4(big_document): ", timeit.timeit("func4(big_document)", setup="from __main__ import func4, big_document", number=100000))
    print("func4(no_match_document): ", timeit.timeit("func4(no_match_document)", setup="from __main__ import func4, no_match_document", number=100000))

Upvotes: 5

Ralf

Reputation: 16505

My code replaces symbols with spaces and does NOT remove those spaces.

For short strings .join() is fast, but for larger strings .translate() is faster if there is a lot to replace. Surprisingly, .replace() is still very fast if there are few replacements to be made.

text: '(Hello World)] *!'
using_replace                     0.046
using_join                        0.016
using_translate                   0.031

text: '~this@is!a^test@'
using_replace                     0.046
using_join                        0.017
using_translate                   0.029

text: '~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@~/()&this@isasd!&=)(/as/dw&%#a^test@'
using_replace                     0.195
using_join                        2.327
using_translate                   0.061

text: 'a long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replacea long text without chars to replace'
using_replace                     0.051
using_join                        2.100
using_translate                   0.064

Comparing some strategies:

def using_replace(text, symbols_to_replace, replacement=' '):
    for char in symbols_to_replace:
        text = text.replace(char, replacement)

    return text

def using_join(text, symbols_to_replace, replacement=' '):
    return ''.join(
        replacement if char in symbols_to_replace else char
        for char in text)

def using_translate(text, symbols_to_replace, replacement=' '):
    translation_dict = str.maketrans(
        dict.fromkeys(symbols_to_replace, replacement))

    return text.translate(translation_dict)

with this timeit code for different texts:

    # a 'set' for faster lookup
    symbols = {
        '`', '~', '!', '@', '#', '$', '%', '^', '&', '*',
        '(', ')', '_', '-', '+', '=', '{', '[', ']', '}',
        '|', '/', ':', ';', '"', '<', ',', '>', '.', '?',
        '\\',
    }

    text_list = [
        '(Hello World)] *!',
        '~this@is!a^test@',
        '~/()&this@isasd!&=)(/as/dw&%#a^test@' * 1000,
        'a long text without chars to replace' * 1000,
    ]
    for s in text_list:
        assert (
                using_replace(s, symbols)
                == using_join(s, symbols)
                == using_translate(s, symbols))

    for s in text_list:
        print()
        print('text:', repr(s))
        for func in [using_replace, using_join, using_translate]:
            t = timeit.timeit(
                'func(s, symbols)',
                'from __main__ import func, s, symbols',
                number=10000)
            print('{:30s} {:8.3f}'.format(func.__name__, t))

Upvotes: 1

vurmux

Reputation: 10020

s = '''
def translate_():
    symbols = '`,~,!,@,#,$,%,^,&,*,(,),_,-,+,=,{,[,],},|,\,:,;,",<,,,>,.,?,/'
    s = '~this@is!a^test @'
    t = str.maketrans(dict.fromkeys(symbols, ' '))
    s.translate(t)
    return s

def replace_():
    symbols = '`,~,!,@,#,$,%,^,&,*,(,),_,-,+,=,{,[,],},|,\,:,;,",<,,,>,.,?,/'
    s = '~this@is!a^test @'
    for symbol in symbols:
        s = s.replace(symbol, ' ')
    return s
'''

print(timeit.timeit('replace_()', setup=s, number=100000))
print(timeit.timeit('translate_()', setup=s, number=100000))

Will print:

0.7663131961598992

0.4139239452779293

So replacing with translate is nearly 2 times faster than using several replaces.

Upvotes: 1

Replace multiple (special) characters - most efficient way?

Answers (6)

A bit of complexity theory

Related Questions