Yannick Borschneck
Yannick Borschneck

Reputation: 87

Trying to size down HDF5 File by changing index field types using h5py

I have a very large CSV File (~12Gb) that looks something like this:

posX,posY,posZ,eventID,parentID,clockTime -117.9853515625,60.2998046875,0.29499998688697815,0,0,0 -117.9853515625,60.32909393310547,0.29499998688697815,0,0,0 -117.9560546875,60.2998046875,0.29499998688697815,0,0,0 -117.9560546875,60.32909393310547,0.29499998688697815,0,0,0 -117.92676544189453,60.2998046875,0.29499998688697815,0,0,0 -117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0 -118.04051208496094,60.34012985229492,4.474999904632568,0,0,0 -118.04051208496094,60.36941909790039,4.474999904632568,0,0,0 -118.04051208496094,60.39870834350586,4.474999904632568,0,0,0

I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:

Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.

Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.

However I am unable to get the wished result. What I have tried so far: Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python? This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.

I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:

import pandas as pd
import h5py

PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)

with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
    dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")

Sadly this created the file and the table but didn't write any data into it.

Expectation Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.

If something is unclear please ask me for clarification. Im still a beginner!

Upvotes: 1

Views: 406

Answers (1)

kcw78
kcw78

Reputation: 8006

Have you considered the numpy module? It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.

See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.

import h5py
import numpy as np

PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )

csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)

print (csv_data.dtype.names)
print (csv_data['posX'])

with h5py.File('SO_55576601.h5', 'w') as h5f:
    dset = h5f.create_dataset('CSV_data', data=csv_data)

h5f.close()   

Upvotes: 1

Related Questions