Sheldon
Sheldon

Reputation: 193

Speed of writing a numpy array to a text file

I need to write a very "high" two-column array to a text file and it is very slow. I find that if I reshape the array to a wider one, the writing speed is much quicker. For example

import time
import numpy as np
dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('test1.txt','w') as f:
    np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('test2.txt','w') as f:
    np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('test3.txt','w') as f:
    np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

With the same number of elements in the three data matrixes, why is the last one much more time-consuming than the other two? Is there any way to speed up the writing of a "high" data array?

Upvotes: 8

Views: 4440

Answers (2)

unutbu
unutbu

Reputation: 879103

As hpaulj pointed out, savetxt is looping through the rows of X and formatting each row individually:

for row in X:
    try:
        v = format % tuple(row) + newline
    except TypeError:
        raise TypeError("Mismatch between array dtype ('%s') and "
                        "format specifier ('%s')"
                        % (str(X.dtype), format))
    fh.write(v)

I think the main time-killer here is all the string interpolation calls. If we pack all the string interpolation into one call, things go much faster:

with open('/tmp/test4.txt','w') as f:
    fmt = ' '.join(['%g']*dataMat3.shape[1])
    fmt = '\n'.join([fmt]*dataMat3.shape[0])
    data = fmt % tuple(dataMat3.ravel())
    f.write(data)

import io
import time
import numpy as np

dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('/tmp/test1.txt','w') as f:
    np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test2.txt','w') as f:
    np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test3.txt','w') as f:
    np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test4.txt','w') as f:
    fmt = ' '.join(['%g']*dataMat3.shape[1])
    fmt = '\n'.join([fmt]*dataMat3.shape[0])
    data = fmt % tuple(dataMat3.ravel())        
    f.write(data)
end = time.perf_counter()
print(end-start)

reports

0.1604848340011813
0.17416274400056864
0.6634929459996783
0.16207673999997496

Upvotes: 11

hpaulj
hpaulj

Reputation: 231335

The code for savetxt is Python and accessible. Basically it does a formatted write for each row/line. In effect it does

for row in arr:
   f.write(fmt%tuple(row))

where fmt is derived from your fmt and shape of the array, e.g.

'%g %g %g ...'

So it's doing a file write for each row of the array. The line format takes some time as well, but it's done in memory with Python code.

I expect loadtxt/genfromtxt will show the same time pattern - it takes longer to read many rows.

pandas has a faster csv load. I haven't seen any discussion of its write speed.

Upvotes: 4

Related Questions