Rohin Kumar
Rohin Kumar

Reputation: 810

best way to store numpy arrays in ascii files

I often have processed numpy arrays that come as a result of lengthy computations. I need to use them elsewhere in calculations. I currently 'pickle' them and unpickle the files into variables as and when I need them.

I noticed for large data sizes (~1M data points), this is slow. I read elsewhere that pickling is not best way to store huge files. I would like to store and read them as ASCII files efficiently to load directly into a numpy array. What is the best way to do this?

say I have a 100k x 3 2D array in a variable 'a'. I want to store it in an ASCII file and load it into a numpy array variable 'b'.

Upvotes: 0

Views: 4900

Answers (3)

Pierre de Buyl
Pierre de Buyl

Reputation: 7293

The problem you pose is directly related to the size of the dataset.

There are several solutions to this quite common problem that come with specialized libraries.

  1. Python-only persistence: joblib offers an alternative to pickle specifically for storing files that are too large for convenient pickling.
  2. HDF5 is a file format that is specifically targeted for storing arrays. The format is multi-language and multi-platform but a very good Python library exists for it: h5py

An example with h5py. To write the data:

import h5py
with h5py.File('data.h5', 'w') as f:
    f.create_dataset('a', data=a)

To read the data:

import h5py
with h5py.File('data.h5', 'r') as f:
    b = f['a'][:]

Upvotes: 2

Andrew Guy
Andrew Guy

Reputation: 9968

Numpy has a range of input and output methods that will do exactly what you are after.

One option would be numpy.save:

import numpy as np

my_array = np.array([1,2,3,4])
with open('data.txt', 'wb') as f:
    np.save(f, my_array, allow_pickle=False)

To load your data again:

with open('data.txt', 'rb') as f:
    my_loaded_array = np.load(f)

Upvotes: 3

Ignacio Vergara Kausel
Ignacio Vergara Kausel

Reputation: 6006

If you want efficiency, ASCII will not be the case. The problem with pickle is that it is dependent on the python version, so it's not a good idea for long term storage. You can try to use other binary technologies, where the most straightforward solution would be to use the numpy.save method as documented here.

Upvotes: 3

Related Questions