Reputation: 83
I am working on a search engine implementation in Python, and would ideally like to store as much of an inverted index into memory.
The current data structure of the index is as follows:
{term: [doc_frequency, {doc_number: [doc_positions]}]}
, where term
is a string and doc_frequency
, doc_number
and doc_positions
are integers. The size of this blows up really quickly for multiple documents.
As the document numbers and positions are frequently recurring integers for different terms, I was thinking if there is a way to leverage this characteristic by not storing a new integer every time but just referencing to the same integer multiple times?
I'm not an expert in data structures or Python at all, so forgive me if this is a stupid question. If there are any other suggestions than my specific question that could improve the memory usage, those are more than welcome as well.
Upvotes: 2
Views: 90
Reputation: 60
I am afraid that storing a pointer (reference) to an integer instead of an integer will not help as the size of the pointer is most probably 64 bits on your system (might be 32 if you are using 32bit python).
Python uses variable sizes of integers but on average it should be less or equal than 64 bits per integer. If you are most frequently use small numbers, see: https://www.pythontutorial.net/advanced-python/python-integers/.
If you can work with some limited range of integer, you can try to use numpy with dtype specified see: https://numpy.org/devdocs/user/basics.types.html. However you will probably need to change the structure of your storage.
Upvotes: 1