eraoul
eraoul

Reputation: 1151

pandas.HDFStore: How do I modify "data_columns" for an existing store? I'd like to add an index to a column not in data columns

I have created a large (120GB; 1 billion rows) HDF5 file using pandas. After an initial creation of the hdf file, I added to the file like so:

with pd.get_store(path_output) as hdf_output:
  for i in range(BIG_LOOP):
    df = ...
    hdf_output.append('all', df, data_columns=[])

I purposefully set data_columns=[] in order to avoid indexing during creation time. Now that I have the HDF file I'd like to add indexes to several columns (say, columns_to_index=['A', 'B', 'C'])

Somehow, accoding to ptdump I do have data_columns:=['A'] at the moment, but I don't recall how that happened. (Perhaps the initial df was written with a different parameter (I was successively adding to the hdfstore for several days and I may have changed something). In any case, though, regardless of how this was created, I'd like to index additional columns.

Simply calling mystore.create_table_index('all', columns=['A', 'B', 'C'], optlevel=9, kind='full') doesn't work, apparently. The first time I ran it it churned for an hour and added 2 GB to the filesize (inspecting metadata shows that the chunksize was increased), but I don't have all 3 indexes (just an index for 'A'). How can I generate the index for all 3 columns?

I also noticed this line in the ptdump -- it seems disturbing that I have "non_index_axes" for the items I'd like to index: non_index_axes := [(1, ['A', 'B', 'C'])]

If it isn't possible to create the index in pandas, I'd appreciate advice on how to do this directly in pytables. (e.g., do I need to first delete any existing indices? and how do I modify the "non_index_axes" and the "data_coumns")

Edit: Anticipating questions about my use case, here's the big picture of what I'm trying to accomplish:

  1. Read in 120 GB of data from CSV files. Each file represents one day of financial data and consists of 100,000s of rows, with around a dozen columns per row. I'm just storing every row, sequentially, in the HDF5 file. I'd like this initial phase to run quickly, hence me turning off indexing. Currently I read and parse each CSV file in 6 seconds, and storing into the HDF5 file as above takes just 1.5 seconds.

  2. Index a handful (not all) of the columns to support a variety of queries, such as getting all items with a given string in column 1 and with a date from column 2 in a certain range.

  3. As time passes, each day I will parse a new CSV file and add it to the HDF5 file. I expect the indices to continue to be updated.

  4. (Depending on my access patterns, the order I store rows in (currently, by date) might continue to be the best order for retrieval. I might also end up needing to sort by a different column in most queries, in which case I think I would need to re-sort the table after each CSV file is parsed and appended.)

Currently I'm stuck on step 2, generating the column indices.

Upvotes: 3

Views: 1698

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

I'd do it bit differently - take a look at this small example:

for chunk in ...  # reading data in chunks:
    # specify `data_columns`, but don't index (`index=False`)
    hdf_output.append('all', chunk, data_columns=[cols_to_index], index=False)

# index columns explicitly     
hdf_output.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')

Upvotes: 3

Related Questions