String concatenation much faster in Python than Go

Question

I'm looking at using Go to write a small program that's mostly handling text. I'm pretty sure, based on what I've heard about Go and Python that Go will be substantially faster. I don't actually have a specific need for insane speeds, but I'd like to get to know Go.

The "Go is going to be faster" idea was supported by a trivial test:

# test.py
print("Hello world")

$ time python dummy.py
Hello world

real    0m0.029s
user    0m0.019s
sys 0m0.010s

// test.go
package main

import "fmt"

func main() {

    fmt.Println("hello world")
}

$ time ./test
hello world

real    0m0.001s
user    0m0.001s
sys 0m0.000s

Looks good in terms of raw startup speed (which is entirely expected). Highly non-scientific justification:

$ strace python test.py 2>&1 | wc -l
1223
$ strace ./test 2>&1 | wc -l
174

However, my next contrived test was how fast is Go when faffing with strings, and I was expecting to be similarly blown away by Go's raw speed. So, this was surprising:

# test2.py
s = ""

for i in range(1000000):
    s += "a"

$ time python test2.py
real    0m0.179s
user    0m0.145s
sys 0m0.013s

// test2.go
package main

func main() {

    s := ""

    for i:= 0; i < 1000000; i++ {
        s += "a";
    }
}

$ time ./test2
real    0m56.840s
user    1m50.836s
sys 0m17.653

So Go is hundreds of times slower than Python.

Now, I know this is probably due to Schlemiel the Painter's algorithm, which explains why the Go implementation is quadratic in i (i is 10 times bigger leads to 100 times slowdown).

However, the Python implementation seems much faster: 10 times more loops only slows it down by twice. The same effect persists if you concatenate str(i), so I doubt there's some kind of magical JIT optimization to s = 100000 * 'a' going on. And it's not much slower if I print(s) at the end, so the variable isn't being optimised out.

Naivety of the concatenation methods aside (there are surely more idiomatic ways in each language), is there something here that I have misunderstood, or is it simply easier in Go than in Python to run into cases where you have to deal with C/C++-style algorithmic issues when handling strings (in which case a straight Go port might not be as uh-may-zing as I might hope without having to, ya'know, think about things and do my homework)?

Or have I run into a case where Python happens to work well, but falls apart under more complex use?

Versions used: Python 3.8.2, Go 1.14.2

torek · Accepted Answer

TL;DR summary: basically you're testing the two implementation's allocators / garbage collectors and heavily weighting the scale on the Python side (by chance, as it were, but this is something the Python folks optimized at some point).

To expand my comments into a real answer:

Both Go and Python have counted strings, i.e., strings are implemented as a two-element header thingy containing a length (byte count or, for Python 3 strings, Unicode characters count) and data pointer.
Both Go and Python are garbage-collected (GCed) languages. That is, in both languages, you can allocate memory without having to worry about freeing it yourself: the system takes care of that automatically.
But the underlying implementations differ, quite a bit in this particular one important way: the version of Python you are using has a reference counting GC. The Go system you are using does not.

With a reference count, the inner bits of the Python string handler can do this. I'll express it as Go (or at least pseudo-Go) although the actual Python implementation is in C and I have not made all the details line up properly:

// add (append) new string t to existing string s
func add_to_string(s, t string_header) string_header {
    need = s.len + t.len
    if s.refcount == 1 { // can modify string in-place
        data = s.data
        if cap(data) >= need {
            copy_into(data + s.len, t.data, t.len)
            return s
        }
    }
    // s is shared or s.cap < need

    new_s := make_new_string(roundup(need))
    // important: new_s has extra space for the next call to add_to_string

    copy_into(new_s.data, s.data, s.len)
    copy_into(new_s.data + s.len, t.data, t.len)
    s.refcount--
    if s.refcount == 0 {
        gc_release_string(s)
    }
    return new_s
}

By over-allocating—rounding up the need value so that cap(new_s) is large—we get about log₂(n) calls to the allocator, where n is the number of times you do s += "a". With n being 1000000 (one million), that's about 20 times that we actually have to invoke the make_new_string function and release (for gc purposes because the collector uses refcounts as a first pass) the old string s.

[Edit: your source archaeology led to commit 2c9c7a5f33d, which suggests less than doubling but still a multiplicative increase. To other readers, see comment.]

The current Go implementation allocates strings without a separate capacity header field (see reflect.StringHeader and note the big caveat that says "don't depend on this, it might be different in future implementations"). Between the lack of a refcount—we can't tell in the runtime routine that adds two strings, that the target has only one reference—and the inability to observe the equivalent of cap(s) (or cap(s.data)), the Go runtime has to create a new string every time. That's one million memory allocations.

To show that the Python code really does use the refcount, take your original Python:

s = ""

for i in range(1000000):
    s += "a"

and add a second variable t like this:

s = ""
t = s

for i in range(1000000):
    s += "a"
    t = s

The difference in execution time is impressive:

$ time python test2.py
        0.68 real         0.65 user         0.03 sys
$ time python test3.py
       34.60 real        34.08 user         0.51 sys

The modified Python program still beats Go (1.13.5) on this same system:

$ time ./test2
       67.32 real       103.27 user        13.60 sys

and I have not poked any further into the details, but I suspect the Go GC is running more aggressively than the Python one. The Go GC is very different internally, requiring write barriers and occasional "stop the world" behavior (of all goroutines that are not doing the GC work). The refcounting nature of the Python GC allows it to never stop: even with a refcount of 2, the refcount on t drops to 1 and then next assignment to t drops it to zero, releasing the memory block for re-use in the next trip through the main loop. So it's probably picking up the same memory block over and over again.

(If my memory is correct, Python's "over-allocate strings and check the refcount to allow expand-in-place" trick was not in all versions of Python. It may have first been added around Python 2.4 or so. This memory is extremely vague and a quick Google search did not turn up any evidence one way or the other. [Edit: Python 2.7.4, apparently.])

String concatenation much faster in Python than Go

Answers (2)

Related Questions