Reputation: 5748
simple problem, but maybe tricky answer:
The problem is how to handle a huge .txt file with pytables.
I have a big .txt file, with MILLIONS of lines, short lines, for example:
line 1 23458739
line 2 47395736
...........
...........
The content of this .txt must be saved into a pytable, ok, it's easy. Nothing else to do with the info in the txt file, just copy into pytables, now we have a pytable with, for example, 10 columns and millions of rows.
The problem comes up when, with the content in the txt file, 10 columns x millions lines are directly generated in the paytable BUT, depending on the data on each line of the .txt file, new colums must be created on the pytable. So how to handle this efficiently??
Solution 1: first copy all the text file, line by line into pytable (millions), and then iterate over each row on pytable (millions again) and, depending on the values, generate the new columns needed for the pytable.
Solution 2: read line by line the .txt file, do whatever needed, calculate the new needed values, and then send all the info to a pyrtable.
Solution 3:.....any other efficient and faster solution???
Upvotes: 1
Views: 444
Reputation: 3637
I think that basic problem here is one of the conceptual model. PyTables' Tables only handle regular (or structured) data. However, the data that you have is irregular or unstructured in that the structure is determined as you read the data. Said another way, PyTables needs the column description to be known completely by the time that create_table()
is called. There is no way around this.
Since in your problem statement any line may add a new column you have no choice but to do this in two full passes through the data: (1) read through the data and determine the columns and (2) write the data to the table. In pseudocode:
import tables as tb
cols = {}
# discover columns
d = open('data.txt')
for line in d:
for col in line:
if col not in cols:
cols['colname'] = col
# write table
d.seek(0)
f = tb.open_file(...)
t = f.create_table(..., description=cols)
for line in d:
row = line_to_row(line)
t.append(row)
d.close()
f.close()
Obviously, if you knew the table structure ahead of time you could skip the first loop and this would be much faster.
Upvotes: 1