Reputation: 1838

Pypy slow at writing file

I've been trying to use PyPy lately, and it's as much as 25x faster for my current project, and it's working pretty well. Unfortunately however, writing files is incredibly slow. Writing files is roughly 60 times slower.

I've been googling around a bit, but I haven't found anything helpful. Is this a known issue? Is there a workaround?

In a simple test case like this:

with file(path, 'w') as f:
    f.writelines(['testing to write a file\n' for i in range(5000000)])

I'm seeing a 60x slowdown in PyPy, compared to regular Python. This is using 64-bit 2.7.3 and PyPy 1.9, 32-bit and Python 2.7.2. Both are on the same OS and machine, of course (Windows 7).

Any help would be appreciated. PyPy is much faster for what I'm doing, but with file write speeds limited to half a megabyte per second, it's decidedly less useful.

Upvotes: 4

Answers (4)

John La Rooy

Reputation: 304393

It's slower, but not 60x slower on this system

TLDR; Use write('\n'.join(...)) instead of writelines(...)

$ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in range(5000000)])"
10 loops, best of 3: 1.15 sec per loop

$ python -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in range(5000000)])"
10 loops, best of 3: 434 msec per loop

xrange makes no difference

$ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in xrange(5000000)])"
10 loops, best of 3: 1.15 sec per loop

Using a generator expression is slower for pypy, but faster for python

$ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines('testing to write a file\n' for i in xrange(5000000))"
10 loops, best of 3: 1.62 sec per loop
$ python -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines('testing to write a file\n' for i in xrange(5000000))"
10 loops, best of 3: 407 msec per loop

moving creation of data outside the benchmark amplifies the difference (~4.2x)

$ pypy -m timeit -s "path='tst'; data=['testing to write a file\n' for i in range(5000000)]" "with file(path, 'w') as f:f.writelines(data)"
10 loops, best of 3: 786 msec per loop
$ python -m timeit -s "path='tst'; data=['testing to write a file\n' for i in range(5000000)]" "with file(path, 'w') as f:f.writelines(data)"
10 loops, best of 3: 189 msec per loop

Using write() instead of writelines() is much faster for both

$ pypy -m timeit -s "path='tst'; data='\n'.join('testing to write a file\n' for i in range(5000000))" "with file(path, 'w') as f:f.write(data)"
10 loops, best of 3: 51.9 msec per loop
$ python -m timeit -s "path='tst'; data='\n'.join('testing to write a file\n' for i in range(5000000))" "with file(path, 'w') as f:f.write(data)"
10 loops, best of 3: 52.4 msec per loop

$ uname -srvmpio
Linux 3.2.0-26-generic #41-Ubuntu SMP Thu Jun 14 17:49:24 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
$ python  --version
Python 2.7.3
$ pypy --version
Python 2.7.2 (1.8+dfsg-2, Feb 19 2012, 19:18:08)
[PyPy 1.8.0 with GCC 4.6.2]

Upvotes: 2

Dr. Jan-Philip Gehrcke

Reputation: 35796

Let's first get your benchmarking method straight.

When the goal is to measure pure file-writing performance, it is a major flaw, a systematical error, to create the data to be written to file within code segment that you are timing. That's because data creation also takes time that you do not want to measure.

Hence, if you plan to keep the whole dummy data in memory, create it before measuring the time.

However, in your case, on-the-fly data generation is likely to be faster than your I/O will ever be. So by using a Python generator, in this case a generator expression, in combination with the write call, you get rid of this systematical error.

I don't know how writelines performs compared to write. However, according to your writelines example:

with file(path, 'w') as f:
    f.writelines('xxxx\n' for _ in xrange(10**6))

Writing large chunks of data with write might be faster:

with file(path, 'w') as f:
    for chunk in ('x'*99999 for _ in xrange(10**3)):
       f.write(chunk)

When you got the benchmarking right, I am pretty sure that you find differences between Python and PyPy. Maybe PyPy is even significantly slower under some circumstances. However, with proper benchmarking I believe that you will manage to find the conditions under which PyPy's file writing is fast enough for your purposes.

Upvotes: 0

Matthew Trevor

Reputation: 14961

You're generating two lists here, one with range and one with the list comprehension.

List 1: one option is to replace the list returning range with the generator xrange. Another is to try PyPy's own optimisation called range-lists.

You can enable this feature with the –objspace-std-withrangelist option.

List 2: you're creating your output list before writing it. This should also be a generator, so turn the list comprehension into a generator expression:

f.writelines('testing to write a file\n' for i in range(5000000))

As long as a generator expression is the only argument passed to a function, it's not even necessary to double-up on the parentheses.

Upvotes: -1

unddoch

Reputation: 6004

xrange is the answer for this example, as it doesn't generate list, but a generator. 64-bit python is probably faster than 32-bit pypy at generating a list with 50 millions items.

If you have another code, post the actual code, not just a test.

Upvotes: 0

Pypy slow at writing file

Answers (4)

Related Questions