Rick
Rick

Reputation: 45291

Get data out of a file without iterating through it multiple times

I've created the following function to pull data out of a file. It works ok, but gets very slow for larger files.

def get_data(file, indexes, data_start, sieve_first = is_float):
    file_list = list(file)
    for i in indexes:
        d_line = i+data_start
        for line in file_list[d_line:]:
            if sieve_first(line.strip().split(',')[0]):
                yield file_list[d_line].strip()
                d_line += 1
            else:
                break

def is_float(f):
    try:
        float(str(f))
    except:
        return False
    else:
        return True

with open('my_data') as f:
    data = get_data(f, index_list, 3)

The file might look like this (line numbers added for clarity):

line 1234567: # <-- INDEX
line 1234568: # +1
line 1234569: # +2
line 1234570:      8, 17.0, 23, 6487.6
line 1234571:      8, 17.0, 23, 6487.6
line 1234572:      8, 17.0, 23, 6487.6
line 1234572:
line 1234572:
line 1234572:

With the above example, lines 1234570 through 1234572 will be yielded.

Since my files are large, there are a couple things I don't like about my function.

  1. First is that it reads the entire file into memory; I do this so I can use line indexing in order to parse the data out.
  2. Second is that the same lines in the file are iterated over many times- this gets very expensive for a large file.

I have fiddled around trying to use iterators to get through the file a single time, but haven't been able to crack it. Any suggestions?

Upvotes: 4

Views: 97

Answers (2)

demented hedgehog
demented hedgehog

Reputation: 7548

Out of left field a bit. But if you have control over your files you could move the data to an sqlite3 db.

Also take a look at mmap and linecache. I imagine these last two are just wrappers around random access files. i.e. you could roll your own by scanning the files once, then building an index->offset lookup table and using seek.

Some of these approaches assume you have some control of the files you're reading?

Also depends on whether you read a lot and write infrequently (if so building an index is not such a bad idea).

Upvotes: 1

Steinar Lima
Steinar Lima

Reputation: 7821

If you only want a small portion of the file, I would use itertools.islice. This function will not store any data but the data you want in memory.

Here's an example:

from itertools import islice

def yield_specific_lines_from_file(filename, start, stop):
    with open(filename, 'rb') as ifile:
        for line in islice(ifile, start, stop):
            yield line

lines = list(yield_specific_lines_from_file('test.txt', 10, 20))

If you use Python 3.3 or newer, you can also simplify this by using the yield from statement:

from itertools import islice

def yield_specific_lines_from_file(filename, start, stop):
    with open(filename, 'rb') as ifile:
        yield from islice(ifile, start, stop)

lines = list(yield_specific_lines_from_file('test.txt', 10, 20))

This will not cache the lines you've already read from the file though. If you want to to this, I suggest that you store all read lines in a dictionary with the line number as key, and only pull the data from the file if needed.

Upvotes: 2

Related Questions