Reputation: 63
I'm having some trouble quickly filling a numba Dict
object with key-value pairs (around 63 million of them). Is there an efficient way to do this in parallel?
The documentation (https://numba.pydata.org/numba-doc/dev/reference/pysupported.html#typed-dict) is clear that numba.typed.Dict
is not thread-safe, and so I think to use prange
with a single Dict
object would be a bad idea. I've tried to use a numba List
of Dict
s, to populate them in parallel and then stitch them together using update
, but I think this last step is also inefficient.
Note that one thing (which may be important) is that all the keys are unique, i.e. once assigned, a key will not be reassigned a value. I think this property makes the problem amenable to an efficient parallelised solution.
Below is an example of the serial approach, which is slow with a large number of key-value pairs.
d = typed.Dict.empty(
key_type=types.UnicodeCharSeq(128), value_type=types.int64
)
@njit
def fill_dict(keys_list, values_list, d):
n = len(keys_list)
for i in range(n):
d[keys_list[i]] = values_list[i]
fill_dict(keys_list, values_list, d)
Can anybody help me?
Many thanks.
Upvotes: 1
Views: 438
Reputation: 11920
You don't have to stitch them together if you preprocess the key into an integer that can be computed for its modulo num_shard value.
# assuming hash() returns an arbitrary integer computed by ascii values
shard = hash(key) % num_shard;
selected_dictionary = dictionary[shard]
value = selected_dictionary[key]
# inserting
# lock only the selected_dictionary
shard = hash(key) % num_shard;
selected_dictionary = dictionary[shard]
selected_dictionary.push((key,value))
The hashing could be something like sum of all ascii codes of chars in key. The modulo based indexing separates blocks of keys so that they can work independently without extra processing except hashing.
Upvotes: 1