Saving dataframe to disk loses numpy datatype

Question

I have a large dataframe, which I need to save to disk. Columns have types like numpy.int32, or numpy.floatxx

             int32Data     ColumName  ...  float32Data  otherTypeData
0        150294240   4260.0  ...                  3.203908e+02  7960.0
1        150294246   4260.0  ...                  0.000000e+00  7960.0
2        150294252   4280.0  ...                  1.117543e+03  7960.0
3        150294258   4260.0  ...                  5.117185e+01  7960.0
4        150294264   4260.0  ...                  5.999993e+02  7960.0
           ...      ...  ...                           ...     ...
1839311  161375508  54592.0  ...                  8.990022e+05     0.0
1839312  161375514  54624.0  ...                  2.097199e+06     0.0
1839313  161375520  54656.0  ...                  1.192150e+06     0.0
1839314  161375526  54688.0  ...                  1.249997e+06     0.0
1839315  161375532  54592.0  ...                  8.949273e+05     0.0

Using the correct dattype saves a lot of space and power processing.

But when I save the dataframe df to disk

np.save(FilePath,df)

and reread it

ReadData=np.load(FilePath).tolist()
df=DataFrame(ReadData)

Then all data is converted to numpy.float64 (and column names are erased)

It is possible to save and load the dataframe while preserving the dataype of each column (and column names)?

Yulian · Accepted Answer

HDF5 storage may be exactly what you are looking for, it allows you to efficiently store large amounts of data, saves data types and allows you to retrieve data very quickly. You can find more details in the documentation.

An example of how to use it:

import pandas as pd

with pd.HDFStore(file_path) as hdf:
  # to save the dataframe to the HDF
  hdf.put(key, df)

  # and to retrieve it later
  df = hdf.get(key)

Saving dataframe to disk loses numpy datatype

Answers (1)

Related Questions