thuyein
thuyein

Reputation: 1802

Python dict.get() Lock

When I use dictionary.get() function, is it locking the whole dictionary? I am developing a multiprocess and multithreading program. The dictionary is used to act as a state table to keep track of data. I have to impose a size limit to the dictionary, so whenever the limit is being hit, I have to do garbage collection on the table, based on the timestamp. The current implementation will delay adding operation while garbage collection is iterating through the whole table.

I will have 2 or more threads, one just to add data and one just to do garbage collection. Performance is critical in my program to handle streaming data. My program is receiving streaming data, and whenever it receives a message, it has to look for it in the state table, then add the record if it's non-existent in the first place, or copy certain information and then send it along the pipe.

I have thought of using multiprocessing to do the search and adding operation concurrently, but if I used processes, I have to make a copy of state table for each process, in that case, the performance overhead for synchronization is too high. And I also read that multiprocessing.manager.dict() is also locking the access for each CRUD operation. I could not spare the overhead for it so my current approach is using threading.

So my question is while one thread is doing .get(), del dict['key'] operation on the table, will the other insertion thread be blocked from accessing it?

Note: I have read through most SO's python dictionary related posts, but I cannot seem to find the answer. Most people only answer that even though python dictionary operations are atomic, it is safer to use a Lock for insertion/update. I'm handling a huge amount of streaming data so Locking every time is not ideal for me. Please advise if there is a better approach.

Upvotes: 2

Views: 3635

Answers (2)

ShadowRanger
ShadowRanger

Reputation: 155684

If the process of hashing or comparing the keys in your dictionary could invoke arbitrary Python code (basically, if the keys aren't all Python built-in types implemented in C, e.g. str, int, float, etc.), then yes, it would be possible for a race condition to occur in which the GIL is released while a bucket collision is being resolved (during the equality test), and another thread could leap in and cause the object being compared to to disappear from the dict. They try to ensure it doesn't actually crash the interpreter, but it has been a source of errors in the past.

If that's a possibility (or you're on a non-CPython interpreter, where there is no GIL providing basic guarantees like this), then you should really use a lock to coordinate access. On CPython, as long as you're on modern Python 3, the cost will be fairly low; contention on the lock should be fairly low since the GIL ensures only one thread is actually running at once; most of the time your lock should be uncontended (because the contention is on the GIL), so the incremental cost of using it should be fairly small.

A note: You might consider using collections.OrderedDict to simplify the process of limiting the size of your table. With OrderedDict, you can implement the size limit as a strict LRU (least-recently used) system by making additions to the table done as:

with lock:
    try:
        try:
            odict.move_to_end(key) # If key already existed, make sure it's "renewed"
        finally:
            odict[key] = value  # set new value whether or not key already existed
    except KeyError:
        # move_to_end raising key error means newly added key, so we might
        # have grown larger than limit
        if len(odict) > maxsize:
            odict.popitem(False)  # Pops oldest item

and usage done as:

with lock:
    # move_to_end optional; if using key means it should live longer, then do it
    # if only setting key should refresh it, omit move_to_end
    odict.move_to_end(key)
    return odict[key]

This does need a lock, but it also reduces the work for garbage collection when it grows too large from "check every key" (O(n) work) to "pop the oldest item off without looking at anything else" (O(1) work).

Upvotes: 4

user8624339
user8624339

Reputation:

A lock is used to avoid race conditions so no two threads could make change to the dict at the same time so it is advisible that you use the lock else you might go into a race condition causing program to fail. A mutex lock can be used to deal with 2 threads.

Upvotes: 0

Related Questions