Reputation: 485
Can this little routine be made faster? With the elif's it makes a comprehension get out of hand, but maybe I haven't tried it the right way.
def cleanup(s):
strng = ''
good = ['\t', '\r', '\n']
for char in s:
if unicodedata.category(char)[0]!="C":
strng += char
elif char in good:
strng += char
elif char not in good:
strng += ' '
return strng
Upvotes: 1
Views: 126
Reputation: 569
If I understand your task correctly, you want to replace all unicode control characters with spaces except \t
, \n
and \r
.
Here's how to do this more efficiently with regular expressions instead of loops.
import re
# make a string of all unicode control characters
# EXCEPT \t - chr(9), \n - chr(10) and \r - chr(13)
control_chars = ''.join(map(unichr, range(0,9) + \
range(11,13) + \
range(14,32) + \
range(127,160)))
# build your regular expression
cc_regex = re.compile('[%s]' % re.escape(control_chars))
def cleanup(s):
# substitute all control characters in the regex
# with spaces and return the new string
return cc_regex.sub(' ', s)
You can control which characters to include or exclude by manipulating the ranges that make up the control_chars
variable. Refer to the List of Unicode characters.
EDIT: Timing results.
Just out of curiosity I ran some timing tests to see which of the three current methods are fastest.
I made three methods named cleanup_op(s)
that was a copy of the OP's code; cleanup_loop(s)
which is Cristian Ciupitu's answer; cleanup_regex(s)
which is my code.
Here's what I ran:
from timeit import default_timer as timer
sample = u"this is a string with some characters and \n new lines and \t tabs and \v and other stuff"*1000
start = timer();cleanup_op(sample);end = timer();print end-start
start = timer();cleanup_loop(sample);end = timer();print end-start
start = timer();cleanup_regex(sample);end = timer();print end-start
The results:
cleanup_op finished in about 1.1 seconds
cleanup_loop finished in about 0.02 seconds
cleanup_regex finished in about 0.004 seconds
So, either one of the answers is a significant improvement over the original code. I think @CristianCiupitu gives a more elegant and pythonic answer while regex is still faster.
Upvotes: 1
Reputation: 20930
If I understand correctly you want to convert all the Unicode control characters to space, except the tab, carriage return and new line. You can use str.translate
for this:
good = map(ord, '\t\r\n')
TBL_CONTROL_TO_SPACE = {
i: u' '
for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i))[0] == "C" and i not in good
}
def cleanup(s):
return s.translate(TBL_CONTROL_TO_SPACE)
Upvotes: 0