Reputation: 2699
I have a script that collects data from an experiment and adds it to a PyTables table. The script gets data in batches (say, groups of 10). It's a little cumbersome in the code to add one row at a time via the normal method, e.g.:
data_batch = experiment.read()
last_time = time.time()
for data_row in data_batch:
row = table.row
row['timestamp'] = last_time
last_time += dt
row['column1'] = data_row[0]
row['column2'] = data_row[1]
row.append()
table.flush()
I would much rather do something like this:
data_batch = experiment.read()
start_index = len(table)
num_rows = len(data_batch)
table.append_n_rows(num_rows)
table.cols.timestamp[start_index:] = last_time + np.arange(num_rows) * dt
last_time += dt * num_rows
table.cols.column1[start_index:] = data_batch[:, 0]
table.cols.column2[start_index:] = data_batch[:, 1]
table.flush()
Does anyone know if there is some function that does the table.append_n_rows
. Right now, all I can do is [table.row for i in range(num_rows)]
, which I feel is hacky and inefficient.
Upvotes: 0
Views: 917
Reputation: 8046
You are on the right track. In table.append(rows)
, the rows
argument can be any object that can be converted to a structured array. This includes: "NumPy structured arrays, lists of tuples or array records, and a string or Python buffer". (I prefer NumPy arrays because I routinely work with them. Your answer shows how to use a list of tuples.)
There is a significant performance advantage adding data in batches instead of 1 row at a time. I ran some tests and posted to SO a few years ago. I/O performance is primarily related to number of batches, and not the batch size. Take a look at this answer for details: pytables writes much faster than h5py
Also, if you are going to create a large table, consider setting expectedrows
parameter when you create the table. This will also improve I/O performance. This has the side benefit of setting an appropriate chunksize
.
Recommended approach with your data.
data_batch = experiment.read()
last_time = time.time()
row_list = []
for data_row in data_batch:
row_list.append( (last_time, data_row[0], data_row[1] ) )
last_time += dt
your_table.append( row_list )
your_table.flush()
Upvotes: 1
Reputation: 2699
There is an example in the source code
I'm going to paste it here to avoid a dead link in the future.
import tables as tb
class Particle(tb.IsDescription):
name = tb.StringCol(16, pos=1) # 16-character String
lati = tb.IntCol(pos=2) # integer
longi = tb.IntCol(pos=3) # integer
pressure = tb.Float32Col(pos=4) # float (single-precision)
temperature = tb.FloatCol(pos=5) # double (double-precision)
fileh = tb.open_file('test4.h5', mode='w')
table = fileh.create_table(fileh.root, 'table', Particle,
"A table")
# Append several rows in only one call
table.append([("Particle: 10", 10, 0, 10 * 10, 10**2),
("Particle: 11", 11, -1, 11 * 11, 11**2),
("Particle: 12", 12, -2, 12 * 12, 12**2)])
fileh.close()
Upvotes: 0