Reputation: 193
I need to write a very "high" two-column array to a text file and it is very slow. I find that if I reshape the array to a wider one, the writing speed is much quicker. For example
import time
import numpy as np
dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('test1.txt','w') as f:
np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
start = time.perf_counter()
with open('test2.txt','w') as f:
np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
start = time.perf_counter()
with open('test3.txt','w') as f:
np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
With the same number of elements in the three data matrixes, why is the last one much more time-consuming than the other two? Is there any way to speed up the writing of a "high" data array?
Upvotes: 8
Views: 4440
Reputation: 879103
As hpaulj pointed out, savetxt
is looping through the rows of X
and formatting each row individually:
for row in X:
try:
v = format % tuple(row) + newline
except TypeError:
raise TypeError("Mismatch between array dtype ('%s') and "
"format specifier ('%s')"
% (str(X.dtype), format))
fh.write(v)
I think the main time-killer here is all the string interpolation calls. If we pack all the string interpolation into one call, things go much faster:
with open('/tmp/test4.txt','w') as f:
fmt = ' '.join(['%g']*dataMat3.shape[1])
fmt = '\n'.join([fmt]*dataMat3.shape[0])
data = fmt % tuple(dataMat3.ravel())
f.write(data)
import io
import time
import numpy as np
dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('/tmp/test1.txt','w') as f:
np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
start = time.perf_counter()
with open('/tmp/test2.txt','w') as f:
np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
start = time.perf_counter()
with open('/tmp/test3.txt','w') as f:
np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
start = time.perf_counter()
with open('/tmp/test4.txt','w') as f:
fmt = ' '.join(['%g']*dataMat3.shape[1])
fmt = '\n'.join([fmt]*dataMat3.shape[0])
data = fmt % tuple(dataMat3.ravel())
f.write(data)
end = time.perf_counter()
print(end-start)
reports
0.1604848340011813
0.17416274400056864
0.6634929459996783
0.16207673999997496
Upvotes: 11
Reputation: 231335
The code for savetxt
is Python and accessible. Basically it does a formatted write for each row/line. In effect it does
for row in arr:
f.write(fmt%tuple(row))
where fmt
is derived from your fmt
and shape of the array, e.g.
'%g %g %g ...'
So it's doing a file write for each row of the array. The line format takes some time as well, but it's done in memory with Python code.
I expect loadtxt/genfromtxt
will show the same time pattern - it takes longer to read many rows.
pandas
has a faster csv load. I haven't seen any discussion of its write speed.
Upvotes: 4