user1462620
user1462620

Reputation: 288

Read in a large data in python

So I am attempting to read in a large data file in python. If the data had one column and 1 million rows I would do:

fp = open(ifile,'r');

for row in fp:  
    process row

My problem arises when the data I am reading in has, say 1 million columns and only 1 row. What I would like is a similar functionality to the fscanf() function in C.

Namely,

while not EOF:  
    part_row = read_next(%lf)  
    work on part_row

I could use fp.read(%lf), if I knew that the format was long float or whatever.

Any thoughts?

Upvotes: 1

Views: 1591

Answers (3)

abarnert
abarnert

Reputation: 365777

There are two basic ways to approach this:

First, you can write a read_column function with its own explicit buffer, either as a generator function:

def column_reader(fp):
    buf = ''
    while True:
        col_and_buf = self.buf.split(',', 1)
        while len(col_and_buf) == 1:
            buf += fp.read(4096)
            col_and_buf = buf.split(',', 1)
        col, buf = col_and_buf
        yield col

… or as a class:

class ColumnReader(object):
    def __init__(self, fp):
        self.fp, self.buf = fp, ''
    def next(self):
        col_and_buf = self.buf.split(',', 1)
        while len(col_and_buf) == 1:
            self.buf += self.fp.read(4096)
            col_and_buf = self.buf.split(',', 1)
        self.buf = buf
        return col

But, if you write a read_until function that handles the buffering internally, then you can just do this:

next_col = read_until(fp, ',')[:-1]

There are multiple read_until recipes on ActiveState.

Or, if you mmap the file, you effectively get this for free. You can just treat the file as a huge string and use find (or regular expressions) on it. (This assumes the entire file fits within your virtual address space—probably not a problem in 64-bit Python builds, but in 32-bit builds, it can be.)


Obviously these are incomplete. They don't handle EOF, or newline (in real life you probably have six rows of a million columns, not one, right?), etc. But this should be enough to show the idea.

Upvotes: 1

Jon Clements
Jon Clements

Reputation: 142176

A million floats in text format really isn't that big... So unless it's proving a bottle neck of some sort, then I wouldn't worry about it and just do:

with open('file') as fin:
    my_data = [process_line(word) for word in fin.read().split()]

A possible alternative (assuming space delimited "words") is something like:

import mmap, re

with open('whatever.txt') as fin:
    mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
    for word in re.finditer(r'(.*?)\s', mf):
        print word.group(1)

And that'll scan the entire file and effectively give a massive word stream, regardless of rows / columns.

Upvotes: 3

chirinosky
chirinosky

Reputation: 4538

you can accomplish this using yield.

def read_in_chunks(file_object, chunk_size=1024):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


f = open('your_file.txt')
for piece in read_in_chunks(f):
    process_data(piece)

Take a look at this question for more examples.

Upvotes: 0

Related Questions