Kenobi
Kenobi

Reputation: 485

Making a python loop faster

Can this little routine be made faster? With the elif's it makes a comprehension get out of hand, but maybe I haven't tried it the right way.

def cleanup(s):
    strng = ''
    good = ['\t', '\r', '\n']
    for char in s:        
        if unicodedata.category(char)[0]!="C":
            strng += char
        elif char in good:
            strng += char
        elif char not in good:
            strng += ' '
    return strng

Upvotes: 1

Views: 126

Answers (2)

keda
keda

Reputation: 569

If I understand your task correctly, you want to replace all unicode control characters with spaces except \t, \n and \r.

Here's how to do this more efficiently with regular expressions instead of loops.

import re

# make a string of all unicode control characters 
# EXCEPT \t - chr(9), \n - chr(10) and \r - chr(13)
control_chars = ''.join(map(unichr, range(0,9) + \
                            range(11,13) + \
                            range(14,32) + \
                            range(127,160)))

# build your regular expression
cc_regex = re.compile('[%s]' % re.escape(control_chars))

def cleanup(s):
    # substitute all control characters in the regex 
    # with spaces and return the new string
    return cc_regex.sub(' ', s)

You can control which characters to include or exclude by manipulating the ranges that make up the control_chars variable. Refer to the List of Unicode characters.

EDIT: Timing results.

Just out of curiosity I ran some timing tests to see which of the three current methods are fastest.

I made three methods named cleanup_op(s) that was a copy of the OP's code; cleanup_loop(s) which is Cristian Ciupitu's answer; cleanup_regex(s) which is my code.

Here's what I ran:

from timeit import default_timer as timer

sample = u"this is a string with some characters and \n new lines and \t tabs and \v and other stuff"*1000

start = timer();cleanup_op(sample);end = timer();print end-start
start = timer();cleanup_loop(sample);end = timer();print end-start
start = timer();cleanup_regex(sample);end = timer();print end-start

The results:

cleanup_op finished in about 1.1 seconds

cleanup_loop finished in about 0.02 seconds

cleanup_regex finished in about 0.004 seconds

So, either one of the answers is a significant improvement over the original code. I think @CristianCiupitu gives a more elegant and pythonic answer while regex is still faster.

Upvotes: 1

Cristian Ciupitu
Cristian Ciupitu

Reputation: 20930

If I understand correctly you want to convert all the Unicode control characters to space, except the tab, carriage return and new line. You can use str.translate for this:

good = map(ord, '\t\r\n')
TBL_CONTROL_TO_SPACE = {
    i: u' '
    for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i))[0] == "C" and i not in good
}

def cleanup(s):
    return s.translate(TBL_CONTROL_TO_SPACE)

Upvotes: 0

Related Questions