Reputation: 465

Can anyone explain this dictionary vs matrix size difference?

I'm creating a recommendation engine for work and ended up with an 8,000 by 8,000 item-item similarity matrix. The matrix is pretty sparse so I set out to make a dictionary with many keys where each key points to a list which is a sorted array of product recommendations (in the form of tuples). I got this to work, see below.

In [191]: dictionary["15454-M6-ECU2="]
Out[191]: 
[('15454-M-TSCE-K9=', 0.8),
 ('15454-M2-AC=', 0.52),
 ('15454-M6-DC-RF', 0.45),
 ('15454-M6-ECU2=', 0.63)]

However, I now have a problem in interpreting the result:

In [204]: sys.getsizeof(dictionary)
Out[204]: 786712

In [205]: sys.getsizeof(similarity_matrix)
Out[205]: 69168

Even though I eliminated a ton of zeros (which were each being represented with either 32 or 64 bits) why did the object size increase even though we eliminated the sparsity in the matrix?

Upvotes: 0

Answers (2)

Dunes

Reputation: 40713

According to the size of your dictionary it seems that you have one key/value pair for each possible key (even where there might be no other keys that are similar to that key).

I imagine your code looks something like this:

# initialise sparse dict with one empty list of similar nodes for each node
sparse_dict = dict((key, []) for key in range(1000))
sparse_dict[0].append((2, 0.5)) # 0 is similar to 2 by 50%

def get_similarity(d, x, y):
    for key, value in d[x]:
        if key == y:
            return value
    return 0

assert get_similarity(sparse_dict, 0, 1) == 0
assert get_similarity(sparse_dict, 0, 2) == 0.5

However, using the get method of a dict you can implement even sparser dictionaries

# initialise empty mapping -- literally an empty dict
very_sparse_dict = {}
very_sparse_dict[0] = [(2, 0.5)] # 0 is similar to 2 by 50%

def get_similarity2(d, x, y):
    for key, value in d.get(x, ()):
        if key == y:
            return value
    return 0

# 0 not linked to 1, so 0% similarity
assert get_similarity2(very_sparse_dict, 0, 1) == 0
# 0 and 2 are similar
assert get_similarity2(very_sparse_dict, 0, 2) == 0.5
# 1 not similar to anything as it is not even present in the dict
assert get_similarity2(very_sparse_dict, 1, 2) == 0

And the size of each dict is:

>>>> print("sparse_dict:", sys.getsizeof(sparse_dict))
sparse_dict: 49248
>>> print("very_sparse_dict", sys.getsizeof(very_sparse_dict))
very_sparse_dict: 288

Upvotes: 0

tdelaney

Reputation: 77347

sys.getsizeof only returns the size of the container, not container plus size of the items inside. The dict returns the same size regardless of the size of the contained values and its still only 98 bytes per key/value pair. Its storing a reference to the key and a reference to the value plus other overhead for the hash /buckets.

>>> sys.getsizeof(dict((i,'a'*10000) for i in range(8000)))
786712
>>> sys.getsizeof(dict((i,'a'*1) for i in range(8000)))
786712
>>> 786712/8000
98

A tuple is much smaller, only storing the reference itself.

>>> sys.getsizeof(tuple((i,'a'*10000) for i in range(8000)))
64056
>>> sys.getsizeof(tuple((i,'a'*1) for i in range(8000)))
64056
>>> 64056/8000
8

Upvotes: 1

Can anyone explain this dictionary vs matrix size difference?

Answers (2)

Related Questions