Reputation: 17631
I am parsing data from a large csv sized 800 GB. For each line of data, I save this as a pandas dataframe.
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
Now, I would like to save this into an HDF5 format, and query the h5 as if it was the entire csv file.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
My approach so far has been:
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
store.append(hdf5_key, df, data_columns=csv_columns, index=False)
That is, I try to save each dataframe df
into the HDF5 under one key. However, this fails:
Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'
So, I could try to save everything into one pandas dataframe first, i.e.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
total_df = pd.concat([total_df, df]) # creates one big CSV
and now store into HDF5 format
store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)
However, I don't think I have the RAM/storage to save all csv lines into total_df
into HDF5 format.
So, how do I append each "single-line" df into an HDF5 so that it ends up as one big dataframe (like the original csv)?
EDIT: Here's a concrete example of a csv file with different data types:
order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....
Upvotes: 1
Views: 2383
Reputation: 97291
Your code should work, can you try the following code:
import pandas as pd
import numpy as np
store = pd.HDFStore("file.h5", "w")
hdf5_key = "single_key"
csv_columns = ["COL%d" % i for i in range(1, 56)]
for i in range(10):
df = pd.DataFrame(np.random.randn(1, len(csv_columns)), columns=csv_columns)
store.append(hdf5_key, df, data_column=csv_columns, index=False)
store.close()
If the code works, then there are something wrong with your data.
Upvotes: 1