Reputation: 45291
I've created the following function to pull data out of a file. It works ok, but gets very slow for larger files.
def get_data(file, indexes, data_start, sieve_first = is_float):
file_list = list(file)
for i in indexes:
d_line = i+data_start
for line in file_list[d_line:]:
if sieve_first(line.strip().split(',')[0]):
yield file_list[d_line].strip()
d_line += 1
else:
break
def is_float(f):
try:
float(str(f))
except:
return False
else:
return True
with open('my_data') as f:
data = get_data(f, index_list, 3)
The file might look like this (line numbers added for clarity):
line 1234567: # <-- INDEX
line 1234568: # +1
line 1234569: # +2
line 1234570: 8, 17.0, 23, 6487.6
line 1234571: 8, 17.0, 23, 6487.6
line 1234572: 8, 17.0, 23, 6487.6
line 1234572:
line 1234572:
line 1234572:
With the above example, lines 1234570 through 1234572 will be yielded.
Since my files are large, there are a couple things I don't like about my function.
I have fiddled around trying to use iterators to get through the file a single time, but haven't been able to crack it. Any suggestions?
Upvotes: 4
Views: 97
Reputation: 7548
Out of left field a bit. But if you have control over your files you could move the data to an sqlite3 db.
Also take a look at mmap and linecache. I imagine these last two are just wrappers around random access files. i.e. you could roll your own by scanning the files once, then building an index->offset lookup table and using seek.
Some of these approaches assume you have some control of the files you're reading?
Also depends on whether you read a lot and write infrequently (if so building an index is not such a bad idea).
Upvotes: 1
Reputation: 7821
If you only want a small portion of the file, I would use itertools.islice. This function will not store any data but the data you want in memory.
Here's an example:
from itertools import islice
def yield_specific_lines_from_file(filename, start, stop):
with open(filename, 'rb') as ifile:
for line in islice(ifile, start, stop):
yield line
lines = list(yield_specific_lines_from_file('test.txt', 10, 20))
If you use Python 3.3 or newer, you can also simplify this by using the yield from
statement:
from itertools import islice
def yield_specific_lines_from_file(filename, start, stop):
with open(filename, 'rb') as ifile:
yield from islice(ifile, start, stop)
lines = list(yield_specific_lines_from_file('test.txt', 10, 20))
This will not cache the lines you've already read from the file though. If you want to to this, I suggest that you store all read lines in a dictionary with the line number as key, and only pull the data from the file if needed.
Upvotes: 2