Reputation: 103
I have a large dataframe, which I need to save to disk. Columns have types like numpy.int32, or numpy.floatxx
int32Data ColumName ... float32Data otherTypeData
0 150294240 4260.0 ... 3.203908e+02 7960.0
1 150294246 4260.0 ... 0.000000e+00 7960.0
2 150294252 4280.0 ... 1.117543e+03 7960.0
3 150294258 4260.0 ... 5.117185e+01 7960.0
4 150294264 4260.0 ... 5.999993e+02 7960.0
... ... ... ... ...
1839311 161375508 54592.0 ... 8.990022e+05 0.0
1839312 161375514 54624.0 ... 2.097199e+06 0.0
1839313 161375520 54656.0 ... 1.192150e+06 0.0
1839314 161375526 54688.0 ... 1.249997e+06 0.0
1839315 161375532 54592.0 ... 8.949273e+05 0.0
Using the correct dattype saves a lot of space and power processing.
But when I save the dataframe df to disk
np.save(FilePath,df)
and reread it
ReadData=np.load(FilePath).tolist()
df=DataFrame(ReadData)
Then all data is converted to numpy.float64 (and column names are erased)
It is possible to save and load the dataframe while preserving the dataype of each column (and column names)?
Upvotes: 0
Views: 32
Reputation: 365
HDF5 storage may be exactly what you are looking for, it allows you to efficiently store large amounts of data, saves data types and allows you to retrieve data very quickly. You can find more details in the documentation.
An example of how to use it:
import pandas as pd
with pd.HDFStore(file_path) as hdf:
# to save the dataframe to the HDF
hdf.put(key, df)
# and to retrieve it later
df = hdf.get(key)
Upvotes: 1