Gulzar
Gulzar

Reputation: 28044

Python hangs silently on large file write

I am trying to write a big list of numpy nd_arrays to disk.

The list is ~50000 elements long

Each element is a nd_array of size (~2048,2) of ints. The arrays have different shapes.

The method I am (curently) using is

@staticmethod
def _write_with_yaml(path, obj):
    with io.open(path, 'w+', encoding='utf8') as outfile:
        yaml.dump(obj, outfile, default_flow_style=False, allow_unicode=True)

I have also tried pickle which also give the same problem:

On small lists (~3400 long), this works fine, finishes fast enough (<30 sec).

On ~6000 long lists, this finishes after ~2 minutes.

When the list gets larger, the process seems not to do anything. No change in RAM or disk activity.

I stopped waiting after 30 minutes.

After force stopping the process, the file suddenly became of significant size (~600MB). I can't know if it finished writing or not.

What is the correct way to write such large lists, know if he write succeeded, and, if possible, knowing when the write/read is going to finish?

How can I debug what's happening when the process seems to hang?

I prefer not to break and assemble the lists manually in my code, I expect the serialization libraries to be able to do that for me.

Upvotes: 0

Views: 2051

Answers (2)

vladmihaisima
vladmihaisima

Reputation: 2248

For the code

import numpy as np
import yaml

x = []
for i in range(0,50000):
    x.append(np.random.rand(2048,2))
print("Arrays generated")
with open("t.yaml", 'w+', encoding='utf8') as outfile:
    yaml.dump(x, outfile, default_flow_style=False, allow_unicode=True)

on my system (MacOSX, i7, 16 GiB RAM, SSD) with Python 3.7 and PyYAML 3.13 the finish time is 61min. During the save the python process occupied around 5 GBytes of memory and final file size is 2 GBytes. This also shows the overhead of the file format: as the size of the data is 50k * 2048 * 2 * 8 (the size of a float is generally 64 bits in python) = 1562 MBytes, means yaml is around 1.3 times worse (and serialisation/deserialisation is also taking time).

To answer your questions:

  1. There is no correct or incorrect way. To have a progress update and estimation of finishing time is not easy (ex: other tasks might interfere with the estimation, resources like memory could be used up, etc.). You can rely on a library that supports that or implement something yourself (as the other answer suggested)
  2. Not sure "debug" is the correct term, as in practice it might be that the process just slow. Doing a performance analysis is not easy, especially if using multiple/different libraries. What I would start with is clear requirements: what do you want from the file saved? Do they need to be yaml? Saving 50k arrays as yaml does not seem the best solution if you care about performance. Should you ask yourself first "which is the best format for what I want?" (but you did not give details so can't say...)

Edit: if you want something just fast, use pickle. The code:

import numpy as np
import yaml
import pickle

x = []
for i in range(0,50000):
    x.append(np.random.rand(2048,2))
print("Arrays generated")
pickle.dump( x, open( "t.yaml", "wb" ) )

finishes in 9 seconds, and generates a file of 1.5GBytes (no overhead). Of course pickle format should be used in very different circumstances than yaml...

Upvotes: 1

Martin
Martin

Reputation: 3385

I cant say this is the answer, but it may be it.

When I was working on app that required fast cycles, I found out that something in the code is very slow. It was opening / closing yaml files.

It was solved by using JSON.

Dont use YAML for anything else than as some kind of config you dont open often.

Solution to your array saving:

np.save(path,array) # path = path+name+'.npy'

If you really need to save a list of arrays, I recommend you to save list with array paths(array themselfs you will save on disk with np.save). Saving python objects on disk is not really what you want. What you want is to save numpy arrays with np.save

Complete solution(Saving example):

for array_index in range(len(list_of_arrays)):
    np.save(array_index+'.npy',list_of_arrays[array_index])
    # path = array_index+'.npy'

Complete solution(Loading example):

list_of_array_paths = ['1.npy','2.npy']
list_of_arrays = []
for array_path in list_of_array_paths:
    list_of_arrays.append(np.load(array_path))

Further advice:

Python cant really handle large arrays. Moreover if you have loaded several of them in the list. From the point of speed and memory, always work with one,two arrays at a time. The rest must be waiting on the disk. So instead of object reference, have reference as a path and when needed, load it from disk.

Also, you said you dont want to assemble the list manually.

Possible solution, which I dont advice, but is possibly exactly what you are looking for

>>> a = np.zeros(shape = [10,5,3])
>>> b = np.zeros(shape = [7,7,9])
>>> c = [a,b]
>>> np.save('data.npy',c)
>>> d = np.load('data.npy')
>>> d.shape
(2,)
>>> type(d)
<type 'numpy.ndarray'>
>>> d.shape
(2,)
>>> d[0].shape
(10, 5, 3)
>>> 

I believe I dont need to comment above mentioned code. However, after loading back, you will lose list as the list will be transformed into numpy array.

Upvotes: 1

Related Questions