jsnider
jsnider

Reputation: 293

"Reading in" large text file into hdf5 via PyTables or PyHDF?

I'm attempting some statistics using SciPy, but my input dataset is quite large (~1.9GB) and in dbf format. The file is large enough that Numpy returns an error message when I try to create an array with genfromtxt. (I've got 3GB ram, but running win32).

i.e.:

Traceback (most recent call last):

  File "<pyshell#5>", line 1, in <module>
    ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))

File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
    for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):

MemoryError

From other posts, I see that the chunked array provided by PyTables could be useful, but my problem is reading in this data in the first place. Or in other words, PyTables or PyHDF easily create a HDF5 output that is desired, but what should I do to get my data into an array first?

For instance:

import numpy, scipy, tables

h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")

group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)

and then I could either create a table or array, but how do I refer back to the original dbf data? In the description?

Thanks for any thoughts you might have!

Upvotes: 3

Views: 3975

Answers (2)

Ethan Furman
Ethan Furman

Reputation: 69041

If the data is in a dbf file, you might try my dbf package -- it only keeps the records in memory that are being accessed, so you should be able to cycle through the records pulling out the data that you need:

import dbf

table = dbf.Table(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf")

sums = [0, 0, 0, 0.0, 0.0, 0]

for record in table:
    for index in range(5):
         sums[index] += record[index]

Upvotes: 0

DaveP
DaveP

Reputation: 7102

If the data is too big to fit in memory, you can work with a memory-mapped file (it's like a numpy array but stored on disk - see docs here), though you may be able to get similar results using HDF5 depending on what operations you need to perform on the array. Obviously this will make many operations slower but this is better than not being able to do them at all.

Because you are hitting a memory limit, I think you cannot use genfromtxt. Instead, you should iterate through your text file one line at a time, and write the data to the relevant position in the memmap/hdf5 object.

It is not clear what you mean by "referring back to the original dbf data"? Obviously you can just store the filename it came from somewhere. HDF5 objects have "attributes" which are designed to store this kind of meta-data.

Also, I have found that using h5py is a much simpler and cleaner way to access hdf5 files than pytables, though this is largely a matter of preference.

Upvotes: 4

Related Questions