Reputation: 4411
Today I was trying to find a method, to do some processing on strings in python. Some more senior programmer than I'm said not to use +=
but use ''.join()
I could also read this in eg: http://wiki.python.org/moin/PythonSpeed/#Use_the_best_algorithms_and_fastest_tools .
But I tested this myself and found a bit strange results ( It's not that I'm trying to second guess them but I want to under stand).
The idea was if there was a string "This is \"an example text\"
containing spaces" the string should be converted to Thisis"an example text"containingspaces
The spaces are removed, but only outside the quotes.
I measured the performance of two different versions of my algorithm one using the ''.join(list)
and one using +=
import time
#uses '+=' operator
def strip_spaces ( s ):
ret_val = ""
quote_found = False
for i in s:
if i == '"':
quote_found = not quote_found
if i == ' ' and quote_found == True:
ret_val += i
if i != ' ':
ret_val += i
return ret_val
#uses "".join ()
def strip_spaces_join ( s ):
#ret_val = ""
ret_val = []
quote_found = False
for i in s:
if i == '"':
quote_found = not quote_found
if i == ' ' and quote_found == True:
#ret_val = ''.join( (ret_val, i) )
ret_val.append(i)
if i != ' ':
#ret_val = ''.join( (ret_val,i) )
ret_val.append(i)
return ''.join(ret_val)
def time_function ( function, data):
time1 = time.time();
function(data)
time2 = time.time()
print "it took about {0} seconds".format(time2-time1)
On my machine this yielded this output with a minor advantage for the algorithm using +=
print '#using += yields ', timeit.timeit('f(string)', 'from __main__ import string, strip_spaces as f', number=1000)
print '#using \'\'.join() yields ', timeit.timeit('f(string)', 'from __main__ import string, strip_spaces_join as f', number=1000)
when timed with timeit :
#using += yields 0.0130770206451
#using ''.join() yields 0.0108470916748
The difference is really minor. But why is ''.join()
not clearly out performing the function that uses +=
, but there seems to be a small advantage for the ''.join() version.
I tested this on Ubuntu 12.04 with python-2.7.3
Upvotes: 3
Views: 2955
Reputation: 71495
(This may be lots of detail the OP is already aware of, but going through the issue in full could help others who end up on this question)
The problem in mystring += suffix
is that strings are immutable, so this is actually equivalent to mystring = mystring + suffix
. So the implementation has to create a new string object, copy all of the characters from mystring
over to it and then copy all the characters from suffix
after that. Then the mystring
name is rebound to refer to the new string; the original string object that was referred to by mystring
is untouched.
On its own, that's not actually a problem. Any method of concatenating these two strings has to do this, including ''.join([mystring, suffix])
; this is actually worse, because it has to first construct a list object and then iterate over it, and while there's no actual data transfer in splicing the empty string between mystring
and suffix
that'll take at least one instruction to sort out.
Where +=
becomes a problem is when you do it repeatedly. Something like this:
mystring = ''
for c in 'abcdefg' * 1000000:
mystring += c
Remember that mystring += c
is equivalent to mystring = mystring + c
. So on the first iteration of the loop it evaluates '' + 'a'
copying 1 character total. Next it does 'a' + 'b'
copying 2 characters total. Then 'ab' + 'c'
for 3 characters, then 'abc' + 'd'
for 4, and I think you can see where this is going. Every subsequent +=
is repeating all of the work of the previous one, and then copying the new string as well. This gets extremely wasteful.
''.join(...)
is better because there you wait until you know all of the strings to copy any of them, and then copy each one directly to its correct place in the final string object. Contrary to what some of the comments and answers have said, this remains the case even if you have to modify your loop to append the strings to a list of strings, and then join
them after the loop. Lists are not immutable, so appending to a list modifies it in place, and it also only needs to append a single reference rather than copying all the characters in a string. Doing thousands of appends to a list is much faster than doing thousands of string +=
operations.
Repeated string +=
is theoretically a problem even without loops, if you just write your source code something like:
s = 'foo'
s += 'bar'
s += 'baz'
...
But in practice you're fairly unlikely to write a long enough sequence of code like that by hand, unless the strings involved are really huge. So just watch out for +=
in loops (or recursive functions).
The reason why you may not see this result when you try to time it, is that there is actually an optimization to string +=
in the CPython interpreter. Lets go back to my silly example loop:
mystring = ''
for c in 'abcdefg' * 1000000:
mystring += c
Every time this does mystring = mystring + c
the old value of mystring
becomes garbage and is deleted, and the name mystring
ends up referring to a newly created string that begins exactly with the contents of the old object. We could optimise this by recognising that mystring
is about to become garbage, so we can do whatever we like to it without anybody caring. So even though strings are immutable at Python level, at the implementation level we'll make them dynamically expandable, and we'll implement target += source
by either doing the normal allocate a new string and copy method, or by expanding the target string and copying only the source characters, depending on whether target
would be made garbage.
The problem with this optimization is that it's easy to disrupt. It works absolutely fine on small self-contained loops (which are the easiest to convert to using join
, by the way). But if you're doing something more complex and you ever accidentally end up with more than one reference to the strings, suddenly the code can run a lot slower.
Say you've got some logging calls in the loop, and the logging system buffers its messages for a while so as to print them all at once (should be safe; strings are immutable). The references to your strings within the logging system could stop the +=
optimization being applicable.
Say you've written your loop as a recursive function (which Python doesn't really like anyway, but still) building up a string with +=
for some reason. The outer stack frames will still have references to the old values.
Or maybe what you're doing with the strings is generating a series of objects, so you're passing them to a class; if the class stores the string directly in the instance, the optimization goes away, but if the class manipulates them first then the optimization still works.
Essentially, the performance of what looks like a really basic primitive operation is either okay or really bad, and which it is depends on other code than the code using +=
. In extreme cases you could have a change to a completely separate file (maybe even a third party package) introduce a massive performance regression in one of your modules that hasn't changed in a long time!
Plus, my understanding is that the +=
optimization is only easy to implement on CPython, because it makes use of reference counting; you can easily tell when the target string is garbage by looking at its reference count, whereas with more sophisticated garbage collection you can't tell until you remove the reference and wait until the garbage collector runs; far too late to make a decision about how to implement +=
. So again, really simple basic code that shouldn't have any issues being portable between Python implementations can suddenly run too slowly to be useful when you move it to another implementation.
And here's some benchmarking to show the scale of the problem:
import timeit
def plus_equals(data):
s = ''
for c in data:
s += c
def simple_join(data):
s = ''.join(data)
def append_join(data):
l = []
for c in data:
l.append(c)
s = ''.join(l)
def plus_equals_non_garbage(data):
s = ''
for c in data:
dummy = s
s += c
def plus_equals_maybe_non_garbage(data):
s = ''
for i, c in enumerate(data):
if i % 1000 == 0:
dummy = s
s += c
def plus_equals_enumerate(data):
s = ''
for i, c in enumerate(data):
if i % 1000 == -1:
dummy = s
s += c
data = ['abcdefg'] * 1000000
for f in (
plus_equals,
simple_join,
append_join,
plus_equals_non_garbage,
plus_equals_maybe_non_garbage,
plus_equals_enumerate,
):
print '{:30}{:20.15f}'.format(f.__name__, timeit.timeit(
'm.{0.__name__}(m.data)'.format(f),
setup='import __main__ as m',
number=1
))
On my system this prints:
plus_equals 0.066924095153809
simple_join 0.013648986816406
append_join 0.086287975311279
plus_equals_non_garbage 540.663727998733521
plus_equals_maybe_non_garbage 0.731688976287842
plus_equals_enumerate 0.156824111938477
The optimization of +=
works very well when it works (even beating the dumb append_join
version by a little). My numbers suggest you might be able to optimize code by replacing append
+ join
with +=
in some cases, but the benefit is not worth the risk of some other future change accidentally getting the blowout (and would most likely be vanishingly small if there was any other actual work going on in the loop; and if there isn't, then you should be using something like the simple_join
version).
By comparing plus_equals_maybe_non_garbage
to plus_equals_enumerate
you can see that even if the optimization only fails one in every thousand +=
operations, there's still a 5-fold loss of performance.
The optimization of +=
is really only intended as a crutch to save people who aren't experienced Python programmers, or who are just quickly and lazily writing some scratch code. If you're thinking about what you're doing, you should be using join
.
Summary: Using +=
is fine for a fixed small number of concatenations. join
is always better for using loops to build up strings. In practice you may not see a huge improvement porting code from +=
to join
, due to the +=
optimization. You should still use join
anyway, because the optimization is unreliable and the difference when it fails to kick in can be enormous.
Upvotes: 3
Reputation: 85482
The difference in performance between +=
and .join
depends on a lot of factors:
The operating system. Running this for increasingly larger strings on unix-like or Windows systems can give quite different results. Typically, you will see a much more pronounce increase in run times under Windows.
The Python implementation. Per default we talk about CPython but there other implementations such as Jython or PyPy. Let's have a look at PyPy. Using the source code from the answer above:
CPython 2.7:
python concat.py
inplace_add_concatenation: 0.420897960663
str_join_concatenation: 0.061793088913
ratio: 6.81140833169
PyPy 1.9:
pypy concat.py
inplace_add_concatenation: 1.26573014259
str_join_concatenation: 0.0392870903015
ratio: 32.2174570038
Even though PyPy is famous for its speed-ups compared to CPython,
it is slower for the +=
version. It was a deliberate desision not include
the `+=' optimization in the default version of PyPy.
Rule of thumb dealing with performance: "Never guess, always measure."
Also reading the docs helps:
6 CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations."""
from http://docs.python.org/2/library/stdtypes.html#typesseq
Upvotes: 1
Reputation: 375585
Another option is to write a function which joins using a generator, rather than appending to a list each time.
For example:
def strip_spaces_gen(s):
quote_found = False
for i in s:
if i == '"':
quote_found = not quote_found
if i == ' ' and quote_found == True:
# Note: you (c|sh)ould drop the == True, but I'll leave it here so as to not give an unfair advantage over the other functions
yield i
if i != ' ':
yield i
def strip_spaces_join_gen(ing):
return ''.join(strip_spaces_gen(ing))
This appears to be about the same (as a join) for a shorter string:
In [20]: s = "This is \"an example text\" containing spaces"
In [21]: %timeit strip_spaces_join_gen(s)
10000 loops, best of 3: 22 us per loop
In [22]: %timeit strip_spaces(s)
100000 loops, best of 3: 13.8 us per loop
In [23]: %timeit strip_spaces_join(s)
10000 loops, best of 3: 23.1 us per loop
But faster for larger strings.
In [24]: s = s * 1000
In [25]: %timeit strip_spaces_join_gen(s)
100 loops, best of 3: 12.9 ms per loop
In [26]: %timeit strip_spaces(s)
100 loops, best of 3: 17.1 ms per loop
In [27]: %timeit strip_spaces_join(s)
100 loops, best of 3: 17.5 ms per loop
Upvotes: 3
Reputation: 1122392
Do use the correct methodology when comparing algorithms; use the timeit
module to eliminate fluctuations in CPU utilization and swapping.
Using timeit
shows there is little difference between the two approaches, but ''.join()
is slightly faster:
>>> s = 1000 * string
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces as f', number=100)
1.3209099769592285
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces_join as f', number=100)
1.2893600463867188
>>> s = 10000 * string
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces as f', number=100)
14.545105934143066
>>> timeit.timeit('f(s)', 'from __main__ import s, strip_spaces_join as f', number=100)
14.43651008605957
Most of the work in your functions is the looping over each and every character and testing for quotes and spaces, not string concatenation itself. Moreover, the ''.join()
variant does more work; you are appending the elements to a list first (this replaces the +=
string concatenation operations), then you are concatenating these values at the end using ''.join()
. And that method is still ever so slightly faster.
You may want to strip back the work being done to compare just the concatenation part:
def inplace_add_concatenation(s):
res = ''
for c in s:
res += c
def str_join_concatenation(s):
''.join(s)
which shows:
>>> s = list(1000 * string)
>>> timeit.timeit('f(s)', 'from __main__ import s, inplace_add_concatenation as f', number=1000)
6.113742113113403
>>> timeit.timeit('f(s)', 'from __main__ import s, str_join_concatenation as f', number=1000)
0.6616439819335938
This shows ''.join()
concatenation is still a heck of a lot faster than +=
. The speed difference lies in the loop; s
is a list in both cases, but ''.join()
loops over the values in C, while the other version has to do all it's looping in Python. And that makes all the difference here.
Upvotes: 8