Reputation: 288
So I am attempting to read in a large data file in python. If the data had one column and 1 million rows I would do:
fp = open(ifile,'r');
for row in fp:
process row
My problem arises when the data I am reading in has, say 1 million columns and only 1 row. What I would like is a similar functionality to the fscanf()
function in C.
Namely,
while not EOF:
part_row = read_next(%lf)
work on part_row
I could use fp.read(%lf)
, if I knew that the format was long float
or whatever.
Any thoughts?
Upvotes: 1
Views: 1591
Reputation: 365777
There are two basic ways to approach this:
First, you can write a read_column
function with its own explicit buffer, either as a generator function:
def column_reader(fp):
buf = ''
while True:
col_and_buf = self.buf.split(',', 1)
while len(col_and_buf) == 1:
buf += fp.read(4096)
col_and_buf = buf.split(',', 1)
col, buf = col_and_buf
yield col
… or as a class:
class ColumnReader(object):
def __init__(self, fp):
self.fp, self.buf = fp, ''
def next(self):
col_and_buf = self.buf.split(',', 1)
while len(col_and_buf) == 1:
self.buf += self.fp.read(4096)
col_and_buf = self.buf.split(',', 1)
self.buf = buf
return col
But, if you write a read_until
function that handles the buffering internally, then you can just do this:
next_col = read_until(fp, ',')[:-1]
There are multiple read_until
recipes on ActiveState.
Or, if you mmap
the file, you effectively get this for free. You can just treat the file as a huge string and use find
(or regular expressions) on it. (This assumes the entire file fits within your virtual address space—probably not a problem in 64-bit Python builds, but in 32-bit builds, it can be.)
Obviously these are incomplete. They don't handle EOF, or newline (in real life you probably have six rows of a million columns, not one, right?), etc. But this should be enough to show the idea.
Upvotes: 1
Reputation: 142176
A million floats in text format really isn't that big... So unless it's proving a bottle neck of some sort, then I wouldn't worry about it and just do:
with open('file') as fin:
my_data = [process_line(word) for word in fin.read().split()]
A possible alternative (assuming space delimited "words") is something like:
import mmap, re
with open('whatever.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for word in re.finditer(r'(.*?)\s', mf):
print word.group(1)
And that'll scan the entire file and effectively give a massive word stream, regardless of rows / columns.
Upvotes: 3
Reputation: 4538
you can accomplish this using yield
.
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('your_file.txt')
for piece in read_in_chunks(f):
process_data(piece)
Take a look at this question for more examples.
Upvotes: 0