Is there a fast way to query distinct column values in a huge hdf5 table with pytables?

Question

I've got a giant-sized hdf5 file consisting of one table, 26 columns, about 3 billion rows (no way it's going to fit in memory). I did a lot of Googling and couldn't find a fast way to query distinct values for a column or group of columns. Is there a way that's faster than iterating through all rows and building lists?

kcw78 · Accepted Answer

This shows how to extract a column of data from Pytables Table to a Numpy array, then use the Numpy np.unique() method to get a new array of unique values only. Option to get an array of unique values and counts of each value also shown.

mytable = h5_file.root.YOUR_DATASET

Col1_array = mytable.col('Col1')
# above statement is equivalent to:
Col1_array = mytable.read(field='Col1')

# get array of unique values:
uarray = np.unique(Col1_array)

# if you also want an array of counts for each unique value:
uarray, carray = np.unique(Col1_array, return_counts=True)

Is there a fast way to query distinct column values in a huge hdf5 table with pytables?

Answers (1)

Related Questions