Reputation: 1838
I've been trying to use PyPy lately, and it's as much as 25x faster for my current project, and it's working pretty well. Unfortunately however, writing files is incredibly slow. Writing files is roughly 60 times slower.
I've been googling around a bit, but I haven't found anything helpful. Is this a known issue? Is there a workaround?
In a simple test case like this:
with file(path, 'w') as f:
f.writelines(['testing to write a file\n' for i in range(5000000)])
I'm seeing a 60x slowdown in PyPy, compared to regular Python. This is using 64-bit 2.7.3 and PyPy 1.9, 32-bit and Python 2.7.2. Both are on the same OS and machine, of course (Windows 7).
Any help would be appreciated. PyPy is much faster for what I'm doing, but with file write speeds limited to half a megabyte per second, it's decidedly less useful.
Upvotes: 4
Views: 904
Reputation: 304393
It's slower, but not 60x slower on this system
TLDR; Use write('\n'.join(...))
instead of writelines(...)
$ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in range(5000000)])"
10 loops, best of 3: 1.15 sec per loop
$ python -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in range(5000000)])"
10 loops, best of 3: 434 msec per loop
xrange
makes no difference
$ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines(['testing to write a file\n' for i in xrange(5000000)])"
10 loops, best of 3: 1.15 sec per loop
Using a generator expression is slower for pypy, but faster for python
$ pypy -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines('testing to write a file\n' for i in xrange(5000000))"
10 loops, best of 3: 1.62 sec per loop
$ python -m timeit -s "path='tst'" "with file(path, 'w') as f:f.writelines('testing to write a file\n' for i in xrange(5000000))"
10 loops, best of 3: 407 msec per loop
moving creation of data outside the benchmark amplifies the difference (~4.2x)
$ pypy -m timeit -s "path='tst'; data=['testing to write a file\n' for i in range(5000000)]" "with file(path, 'w') as f:f.writelines(data)"
10 loops, best of 3: 786 msec per loop
$ python -m timeit -s "path='tst'; data=['testing to write a file\n' for i in range(5000000)]" "with file(path, 'w') as f:f.writelines(data)"
10 loops, best of 3: 189 msec per loop
Using write()
instead of writelines()
is much faster for both
$ pypy -m timeit -s "path='tst'; data='\n'.join('testing to write a file\n' for i in range(5000000))" "with file(path, 'w') as f:f.write(data)"
10 loops, best of 3: 51.9 msec per loop
$ python -m timeit -s "path='tst'; data='\n'.join('testing to write a file\n' for i in range(5000000))" "with file(path, 'w') as f:f.write(data)"
10 loops, best of 3: 52.4 msec per loop
$ uname -srvmpio
Linux 3.2.0-26-generic #41-Ubuntu SMP Thu Jun 14 17:49:24 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
$ python --version
Python 2.7.3
$ pypy --version
Python 2.7.2 (1.8+dfsg-2, Feb 19 2012, 19:18:08)
[PyPy 1.8.0 with GCC 4.6.2]
Upvotes: 2
Reputation: 35796
Let's first get your benchmarking method straight.
When the goal is to measure pure file-writing performance, it is a major flaw, a systematical error, to create the data to be written to file within code segment that you are timing. That's because data creation also takes time that you do not want to measure.
Hence, if you plan to keep the whole dummy data in memory, create it before measuring the time.
However, in your case, on-the-fly data generation is likely to be faster than your I/O will ever be. So by using a Python generator, in this case a generator expression, in combination with the write
call, you get rid of this systematical error.
I don't know how writelines
performs compared to write
. However, according to your writelines
example:
with file(path, 'w') as f:
f.writelines('xxxx\n' for _ in xrange(10**6))
Writing large chunks of data with write
might be faster:
with file(path, 'w') as f:
for chunk in ('x'*99999 for _ in xrange(10**3)):
f.write(chunk)
When you got the benchmarking right, I am pretty sure that you find differences between Python and PyPy. Maybe PyPy is even significantly slower under some circumstances. However, with proper benchmarking I believe that you will manage to find the conditions under which PyPy's file writing is fast enough for your purposes.
Upvotes: 0
Reputation: 14961
You're generating two lists here, one with range
and one with the list comprehension.
List 1: one option is to replace the list returning range
with the generator xrange
. Another is to try PyPy's own optimisation called range-lists.
You can enable this feature with the
–objspace-std-withrangelist
option.
List 2: you're creating your output list before writing it. This should also be a generator, so turn the list comprehension into a generator expression:
f.writelines('testing to write a file\n' for i in range(5000000))
As long as a generator expression is the only argument passed to a function, it's not even necessary to double-up on the parentheses.
Upvotes: -1
Reputation: 6004
xrange
is the answer for this example, as it doesn't generate list, but a generator. 64-bit python is probably faster than 32-bit pypy at generating a list with 50 millions items.
If you have another code, post the actual code, not just a test.
Upvotes: 0