Reputation: 29527
This question has been asked before, but the fast answers that I have seen also remove the trailing spaces, which I don't want.
" a bc "
should become
" a bc "
I have
text = re.sub(' +', " ", text)
but am hoping for something faster. The suggestion that I have seen (and which won't work) is
' '.join(text.split())
Note that I will be doing this to lots of smaller texts so just checking for a trailing space won't be so great.
Upvotes: 1
Views: 4571
Reputation: 77454
If you want to really optimize stuff like this, use C, not python.
Try cython, that is pretty much Python syntax but fast as C.
Here is some stuff you can time:
import array
buf=array.array('c')
input=" a bc "
space=False
for c in input:
if not space or not c == ' ': buf.append(c)
space = (c == ' ')
buf.tostring()
Also try using cStringIO
:
import cStringIO
buf=cStringIO.StringIO()
input=" a bc "
space=False
for c in input:
if not space or not c == ' ': buf.write(c)
space = (c == ' ')
buf.getvalue()
But again, if you want to make such things really fast, don't do it in python. Use cython
. The two approaches I gave here will likely be slower, just because they put much more work on the python interpreter. If you want these things to be fast, do as little as possible in python. The for c in input
loop likely already kills all theoretical performance of above approaches.
Upvotes: 2
Reputation: 45644
FWIW, some timings
$ python -m timeit -s 's=" a bc "' 't=s[:]' "while ' ' in t: t=t.replace(' ', ' ')"
1000000 loops, best of 3: 1.05 usec per loop
$ python -m timeit -s 'import re;s=" a bc "' "re.sub(' +', ' ', s)"
100000 loops, best of 3: 2.27 usec per loop
$ python -m timeit -s 's=" a bc "' "''.join((s[0],' '.join(s[1:-1].split()),s[-1]))"
1000000 loops, best of 3: 0.592 usec per loop
$ python -m timeit -s 'import re;s=" a bc "' "re.sub(' {2,}', ' ', s)"
100000 loops, best of 3: 2.34 usec per loop
$ python -m timeit -s 's=" a bc "' '" "+" ".join(s.split())+" "'
1000000 loops, best of 3: 0.387 usec per loop
Upvotes: 3
Reputation: 21914
Just a small rewrite of the suggestion up there, but just because something has a small fault doesn't mean you should assume it won't work.
You could easily do something like:
front_space = lambda x:x[0]==" "
trailing_space = lambda x:x[-1]==" "
" "*front_space(text)+' '.join(text.split())+" "*trailing_space(text)
Upvotes: 0