Reputation: 8996
I read a random subset of rows from an HDF table with the following Python/Pandas code:
hdf_store = pd.HDFStore('path_to_data.h5')
total_rows = hdf_store.get_storer('hdf_table_name').nrows
num_rows = int(total_rows * .25)
row_indices = np.random.randint(0,rows_indices,size=num_rows)
my_df = pd.read_hdf(hdf_store, 'hdf_table_name', where=pd.Index(row_indices))
Later in the program, I would like to pull the rest of the rows of data from the HDF5 table. But the following threw errors:
rest_of_rows = pd.read_hdf(hdf_store, 'hdf_table_name',
where=pd.Index(not in (row_indices)))
rest_of_rows = pd.read_hdf(hdf_store, 'hdf_table_name',
where=not pd.Index(row_indices))
Is there a way to pull HDF rows by records not in a list of indices?
Because the table is larger than my RAM, I would like to avoid pulling all the rows from HDF upfront (even in chunks) and then splitting it to hold both tables at once. I could map the indices to another column, and subset on rows not in the mapped value of that column. But that would presumably be a lot slower than querying on the index directly.
Upvotes: 1
Views: 366
Reputation: 210912
You can use Index.difference method.
Demo:
# randomly select 25% of index elements (without duplicates `replace=False`)
sample_idx = np.random.choice(np.arange(total_rows), total_rows//4, replace=False)
# select remaining index elements
rest_idx = pd.Index(np.arange(total_rows)).difference(sample_idx)
# get rest rows by index
rest = store.select('hdf_table_name', where=rest_idx)
PS optionally you may want to select rest of rows in chunks...
Upvotes: 1