Jonathan Wheeler
Jonathan Wheeler

Reputation: 2699

Adding rows in bulk to PyTables array

I have a script that collects data from an experiment and adds it to a PyTables table. The script gets data in batches (say, groups of 10). It's a little cumbersome in the code to add one row at a time via the normal method, e.g.:

data_batch = experiment.read()
last_time = time.time()
for data_row in data_batch:
    row = table.row
    row['timestamp'] = last_time
    last_time += dt
    row['column1'] = data_row[0]
    row['column2'] = data_row[1]
    row.append()
table.flush()

I would much rather do something like this:

data_batch = experiment.read()
start_index = len(table)
num_rows = len(data_batch)

table.append_n_rows(num_rows)
table.cols.timestamp[start_index:] = last_time + np.arange(num_rows) * dt
last_time += dt * num_rows
table.cols.column1[start_index:] = data_batch[:, 0]
table.cols.column2[start_index:] = data_batch[:, 1]
table.flush()

Does anyone know if there is some function that does the table.append_n_rows. Right now, all I can do is [table.row for i in range(num_rows)], which I feel is hacky and inefficient.

Upvotes: 0

Views: 917

Answers (2)

kcw78
kcw78

Reputation: 8046

You are on the right track. In table.append(rows), the rows argument can be any object that can be converted to a structured array. This includes: "NumPy structured arrays, lists of tuples or array records, and a string or Python buffer". (I prefer NumPy arrays because I routinely work with them. Your answer shows how to use a list of tuples.)

There is a significant performance advantage adding data in batches instead of 1 row at a time. I ran some tests and posted to SO a few years ago. I/O performance is primarily related to number of batches, and not the batch size. Take a look at this answer for details: pytables writes much faster than h5py

Also, if you are going to create a large table, consider setting expectedrows parameter when you create the table. This will also improve I/O performance. This has the side benefit of setting an appropriate chunksize.

Recommended approach with your data.

data_batch = experiment.read()
last_time = time.time()

row_list = []
for data_row in data_batch:
    row_list.append( (last_time, data_row[0], data_row[1] ) )
    last_time += dt

your_table.append( row_list )    
your_table.flush()

Upvotes: 1

Jonathan Wheeler
Jonathan Wheeler

Reputation: 2699

There is an example in the source code

I'm going to paste it here to avoid a dead link in the future.

import tables as tb
class Particle(tb.IsDescription):
    name        = tb.StringCol(16, pos=1) # 16-character String
    lati        = tb.IntCol(pos=2)        # integer
    longi       = tb.IntCol(pos=3)        # integer
    pressure    = tb.Float32Col(pos=4)  # float  (single-precision)
    temperature = tb.FloatCol(pos=5)    # double (double-precision)
fileh = tb.open_file('test4.h5', mode='w')
table = fileh.create_table(fileh.root, 'table', Particle,
                           "A table")
# Append several rows in only one call
table.append([("Particle:     10", 10, 0, 10 * 10, 10**2),
              ("Particle:     11", 11, -1, 11 * 11, 11**2),
              ("Particle:     12", 12, -2, 12 * 12, 12**2)])
fileh.close()

Upvotes: 0

Related Questions