Reputation: 23
This is kind of a question, but it's also kind of me just hoping I don't have to write a bunch of code to get behavior I want. (Plus if it already exists, it probably runs faster than what I would write anyway.) I have a number of large lists of numbers that cannot fit into memory -- at least not all at the same time. Which is fine because I only need a small portion of each list at a time, and I know how to save the lists into files and read out the part of the list I need. The problem is that my method of doing this is somewhat inefficient as it involves iterating through the file for the part I want. So, I was wondering if there happened to be some library or something out there that I'm not finding that allows me to index a file as though it were a list using the []
notation I'm familiar with. Since I'm writing the files myself, I can make the formatting of them whatever I need to, but currently my files contain nothing but the elements of the list with \n
as a deliminator between values.
Just to recap what I'm looking for/make it more specific.
f[1:3]
) should return as a python list object in memoryf[i] = x
should write the value x
to the file f
in the location corresponding to index i
)To be honest, I don't expect this to exist, but you never know when you miss something in your research. So, I figured I'd ask. On a side note if this doesn't exist, is possible to overload the []
operator in python?
Upvotes: 2
Views: 347
Reputation: 23186
If your data is purely numeric you could consider using numpy
arrays, and storing the data in npy
format. Once stored in this format, you could load the memory-mapped file as:
>>> X = np.load("some-file.npy", mmap_mode="r")
>>> X[1000:1003]
memmap([4, 5, 6])
This access will load directly from disk without requiring the loading of leading data.
Upvotes: 1
Reputation: 19885
You can actually do this by writing a simple class, I think:
class FileWrapper:
def __init__(self, path, **kwargs):
self._file = open(path, 'r+', **kwargs)
def _do_single(self, where, s=None):
if where >= 0:
self._seek(where)
else:
self._seek(where, 2)
if s is None:
return self._read(1)
else:
return self._write(s)
def _do_slice_contiguous(self, start, end, s=None):
if start is None:
start = 0
if end is None:
end = -1
self._seek(start)
if s is None:
return self._read(end - start)
else:
return self._write(s)
def _do_slice(self, where, s=None):
if s is None:
result = []
for index in where:
file._seek(index)
result.append(file.read(1))
return result
else:
for index, char in zip(where, s):
file._seek(index)
file._write(char)
return len(s)
def __getitem__(self, key):
if isinstance(key, int):
return self._do_single(key)
elif isinstance(key, slice):
if self._is_contiguous(key):
return self._do_slice_contiguous(key.start, key.stop)
else:
return self._do_slice(self._process_slice(key))
else:
raise ValueError('File indices must be ints or slices.')
def __setitem__(self, key, value):
if isinstance(key, int):
return self._do_single(key, value)
elif isinstance(key, slice):
if self._is_contiguous(key):
return self._do_slice_contiguous(key.start, key.stop, value)
else:
where = self._process_slice(key)
if len(where) == len(value):
return self._do_slice(where, value)
else:
raise ValueError('Length of slice not equal to length of string to be written.')
def __del__(self):
self._file.close()
def _is_contiguous(self, key):
return key.step is None or key.step == 1
def _process_slice(self, key):
return range(key.start, key.stop, key.step)
def _read(self, size):
return self._file.read(size)
def _seek(self, offset, whence=0):
return self._file.seek(offset, whence)
def _write(self, s):
return self._file.write(s)
I'm sure many optimisations could be made, since I rushed through this, but it was fun to write.
This does not answer the question in full, because it supports random access of characters, as supposed to lines, which are at a higher level of abstraction and more complicated to handle (since they can be variable length)
Upvotes: 1