dts
dts

Reputation: 125

Is there a fast way to query distinct column values in a huge hdf5 table with pytables?

I've got a giant-sized hdf5 file consisting of one table, 26 columns, about 3 billion rows (no way it's going to fit in memory). I did a lot of Googling and couldn't find a fast way to query distinct values for a column or group of columns. Is there a way that's faster than iterating through all rows and building lists?

Upvotes: 1

Views: 1059

Answers (1)

kcw78
kcw78

Reputation: 8091

This shows how to extract a column of data from Pytables Table to a Numpy array, then use the Numpy np.unique() method to get a new array of unique values only. Option to get an array of unique values and counts of each value also shown.

mytable = h5_file.root.YOUR_DATASET

Col1_array = mytable.col('Col1')
# above statement is equivalent to:
Col1_array = mytable.read(field='Col1')

# get array of unique values:
uarray = np.unique(Col1_array)

# if you also want an array of counts for each unique value:
uarray, carray = np.unique(Col1_array, return_counts=True)

Upvotes: 1

Related Questions