Reputation: 125
I've got a giant-sized hdf5 file consisting of one table, 26 columns, about 3 billion rows (no way it's going to fit in memory). I did a lot of Googling and couldn't find a fast way to query distinct values for a column or group of columns. Is there a way that's faster than iterating through all rows and building lists?
Upvotes: 1
Views: 1059
Reputation: 8091
This shows how to extract a column of data from Pytables Table to a Numpy array, then use the Numpy np.unique()
method to get a new array of unique values only. Option to get an array of unique values and counts of each value also shown.
mytable = h5_file.root.YOUR_DATASET
Col1_array = mytable.col('Col1')
# above statement is equivalent to:
Col1_array = mytable.read(field='Col1')
# get array of unique values:
uarray = np.unique(Col1_array)
# if you also want an array of counts for each unique value:
uarray, carray = np.unique(Col1_array, return_counts=True)
Upvotes: 1