AlwaysLearning
AlwaysLearning

Reputation: 8011

Mapping a file to memory

Consider a file whose contents is a pickled python object. To make things concrete, suppose that this object is simply a list of strings, but in reality it is more complex.

My python script reads that file on start-up. Following that, certain events trigger minor updates of the read object (in the example, a string gets added or deleted anywhere in the list). To keep the file up-to-date, the script pickles the object and writes it into the file (i.e. the whole file is written from scratch).

Is there a way to avoid re-writing the whole object every time? I am open to other ways of storing the data, i.e. not relying on pickling. I guess that I am looking for some way of mapping the file to memory, so that any update to the object in memory would result in an instant update to the file.

Upvotes: 2

Views: 304

Answers (1)

Aaron
Aaron

Reputation: 11075

update tl;dr

  • yes, only string variables work for indices
  • for each key value pair, any changes to the value require the entire value to be re-written to disk
    • the more keys and the smaller values you use, the smaller your disk writes will be

If you are already using pickles, I would recommend pythons's standard shelve library. It allows you to set up a dbm style database and read/write any pickleable object

from the docs:

import shelve

d = shelve.open(filename) # open -- file may get suffix added by low-level
                          # library

d[key] = data   # store data at key (overwrites old data if
                # using an existing key)
data = d[key]   # retrieve a COPY of data at key (raise KeyError if no
                # such key)
del d[key]      # delete data stored at key (raises KeyError
                # if no such key)
flag = d.has_key(key)   # true if the key exists
klist = d.keys() # a list of all existing keys (slow!)

# as d was opened WITHOUT writeback=True, beware:
d['xx'] = range(4)  # this works as expected, but...
d['xx'].append(5)   # *this doesn't!* -- d['xx'] is STILL range(4)!

# having opened d without writeback=True, you need to code carefully:
temp = d['xx']      # extracts the copy
temp.append(5)      # mutates the copy
d['xx'] = temp      # stores the copy right back, to persist it

# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.

d.close()       # close it

as per comments, let me explain a few things..

Strings as keys:

sorta like it says on the tin, you can only index key - value pairs with a string. This is unlike normal dictionaries that can take any hashable type as a key. For example:

import shelve
database = shelve.open(filename) #open our database
dictionary = {} #create an empty dict

ingredients = ['flour', 'sugar', 'butter', 'water'] #recipe for something tasty

database['pie dough'] = ingredients #works because 'pie dough' is a string
dictionary['pie dough'] = ingredients #works because strings are hashable

recipe = 'pie dough'
database[recipe] = ingredients #you can use a variable as long as it is of <type 'str'>

#say we want to just number our entries..
database[5] = ingredients #won't work because 5 is an integer not a string
dictionary[5] = ingredients #works because integers are hashable

database.close()

mutable objects

The statement in the docs about mutable objects uses some technical wording, but generally refers to the result of your data residing in ram vs on your hard drive. basically if one of your objects is a list (a mutable object) you could retrieve it like so:

mylist = database['pie dough']
#or
mylist = dictionary['pie dough']

we have since realized that we need to add salt to this recipe, and our object is a list, so we can relatively easily call:

mylist.append('salt')

we then can write this new updated recipe back to the database or dictionary using simple assignment

database['pie dough'] = mylist
#or
dictionary['pie dough'] = mylist

this method will always work for both the database and the dictionary. The difference comes when you want to reference objects directly. It is superfluous to create the mylist variable simply as a temporary variable when we can refer to the list itself that is contained within the dictionary:

dictionary['pie dough'].append('salt')

this is a direct result from the fact that the values in a dictionary are references to the original objects in ram. when you modify dictionary['pie dough'], you are also modifying mylist, and ingredients because they are actually the same object. with the database when you load an object: mylist = database['pie dough'] you are making a copy of what is on the hard drive. you can modify that copy all you want, but it won't change whats on the hard drive because it's not the same object (it's a copy).

The workaround for this is to open the database using the keyword writeback=True. This provides somewhat similar functionality but with some caveats... While now it is possible to call append(...) directly on dictionary objects without making a copy, the changes won't appear until you call dictionary.sync() or dictionary.close(). for this I'll start with a fresh example:

import shelve

d = shelve.open(filename)
d['dough'] = ['flour', 'sugar', 'butter', 'water']
d.close() #write initial recipe to our database

#some time later we want to update the recipe
d = shelve.open(filename, writeback=True)
d['dough'].append('salt')
d.close()

Now any time you access part of the database that part is loaded into ram, and updates can be made directly to the object. the big caveat to this is that the update is not written to the hard drive until d.close() is called. It is common for databases to be larger than all the available ram in a system, and trying to update too many objects without calling d.sync() to write pending changes to the hard drive or periodically closing and re-opening the database can cause issues where you run out of memory.

Upvotes: 1

Related Questions