andy
andy

Reputation: 4281

Why do Python dict consuming more memory when stored with more than 550k keys

I use Python's dict type to store a data file with more than 550k keys, almost 29M. however, after reading the data file, the memory used is more than 70M which is unnormal.

So, how does this happen?

Below is the function to read the data file.

def _update_internal_metrics(self, signum, _):
    """Read the dumped metrics file"""
    logger.relayindex('reload dumped file begins')
    dumped_metrics_file_path = os.path.join(settings.DATA_DIR,
                                            settings.DUMPED_METRICS_FILE)
    epoch = int(time.time())
    try:
        new_metrics = {}
        with open(dumped_metrics_file_path) as dumped_metrics_file:
            for line in dumped_metrics_file:
                line = line.strip()
                new_metrics[line] = epoch
    except Exception:
        if not signum:
            self._reload_dumped_file()
        logger.relayindex("Dumped metrics file does not exist or can"
                          "not be read. No update")
    else:
        settings["metrics"] = new_metrics

    instrumentation.increment('dumped.Reload')
    logger.relayindex('reload dumped file ends')

Upvotes: 2

Views: 76

Answers (1)

Karoly Horvath
Karoly Horvath

Reputation: 96258

First of all, top isn't the right way to check it, as it will tell you the memory consumption of the whole process. You can use getsizeof from the sys module:

sys.getsizeof(new_metrics)

Second, there are some overhead associated both with strings and hash tables:

sys.getsizeof('')

On my system this is 24 bytes overhead, and the overhead is consistent regardless of the string size. With 550k keys that's about 13M overhead.

Python tries to keep the hash tables not too dense as that would kill the lookup time. AFAIK the cpython implementation uses a 2x growth factor, with 2^k table sizes. As your key size is just above a factor of two (math.log(550000,2) # 19.06), it's relatively sparse with 2 ** 20 = 1048576 slots. On your 64 bit system with 8 byte object pointers per string that's an additional 8M overhead. You also store integers, which weren't in the original file (another 8M), and each hash table slot also contains the stored hash value (another 8M). See the source of PyDictEntry.

That's 66M total, and of course you need some space for the rest of your python app. It all looks fine to me.

Upvotes: 1

Related Questions