Python hangs silently on large file write

Question

I am trying to write a big list of numpy nd_arrays to disk.

The list is ~50000 elements long

Each element is a nd_array of size (~2048,2) of ints. The arrays have different shapes.

The method I am (curently) using is

@staticmethod
def _write_with_yaml(path, obj):
    with io.open(path, 'w+', encoding='utf8') as outfile:
        yaml.dump(obj, outfile, default_flow_style=False, allow_unicode=True)

I have also tried pickle which also give the same problem:

On small lists (~3400 long), this works fine, finishes fast enough (<30 sec).

On ~6000 long lists, this finishes after ~2 minutes.

When the list gets larger, the process seems not to do anything. No change in RAM or disk activity.

I stopped waiting after 30 minutes.

After force stopping the process, the file suddenly became of significant size (~600MB). I can't know if it finished writing or not.

What is the correct way to write such large lists, know if he write succeeded, and, if possible, knowing when the write/read is going to finish?

How can I debug what's happening when the process seems to hang?

I prefer not to break and assemble the lists manually in my code, I expect the serialization libraries to be able to do that for me.

vladmihaisima · Accepted Answer

For the code

import numpy as np
import yaml

x = []
for i in range(0,50000):
    x.append(np.random.rand(2048,2))
print("Arrays generated")
with open("t.yaml", 'w+', encoding='utf8') as outfile:
    yaml.dump(x, outfile, default_flow_style=False, allow_unicode=True)

on my system (MacOSX, i7, 16 GiB RAM, SSD) with Python 3.7 and PyYAML 3.13 the finish time is 61min. During the save the python process occupied around 5 GBytes of memory and final file size is 2 GBytes. This also shows the overhead of the file format: as the size of the data is 50k * 2048 * 2 * 8 (the size of a float is generally 64 bits in python) = 1562 MBytes, means yaml is around 1.3 times worse (and serialisation/deserialisation is also taking time).

To answer your questions:

There is no correct or incorrect way. To have a progress update and estimation of finishing time is not easy (ex: other tasks might interfere with the estimation, resources like memory could be used up, etc.). You can rely on a library that supports that or implement something yourself (as the other answer suggested)
Not sure "debug" is the correct term, as in practice it might be that the process just slow. Doing a performance analysis is not easy, especially if using multiple/different libraries. What I would start with is clear requirements: what do you want from the file saved? Do they need to be yaml? Saving 50k arrays as yaml does not seem the best solution if you care about performance. Should you ask yourself first "which is the best format for what I want?" (but you did not give details so can't say...)

Edit: if you want something just fast, use pickle. The code:

import numpy as np
import yaml
import pickle

x = []
for i in range(0,50000):
    x.append(np.random.rand(2048,2))
print("Arrays generated")
pickle.dump( x, open( "t.yaml", "wb" ) )

finishes in 9 seconds, and generates a file of 1.5GBytes (no overhead). Of course pickle format should be used in very different circumstances than yaml...

Python hangs silently on large file write

Answers (2)

Related Questions