Reputation: 9363
i am building a small search engine to search a collection of pdfs. From each pdf i extract a set of tokens and store it in database. I do not want to store duplicate tokens in database, instead i want to store count of each token in the database. Does python has any special datastructure that do not store duplicates but stores the counts of each token?
Upvotes: 2
Views: 144
Reputation: 34688
You could always implement an object for every file, giving it a number of methods, like open and display and etc etc. You could then define __hash__
and __eq__
for the object, this would allow you to store items in a set, causing the duplicates to just update a single instance inside the set.
This is just another way of doing something by no means is it the best method.
Upvotes: 0
Reputation: 7644
The collections package has defaultdict which can be used as a key-value storage with a counter:
>>> s = 'mississippi'
>>> d = defaultdict(int)
>>> for k in s:
... d[k] += 1
...
>>> d.items()
[('i', 4), ('p', 2), ('s', 4), ('m', 1)]
Just so notice: This is not a databse, it's a pure in memory storage. You would have to save this data somehow!
Upvotes: 3
Reputation: 29164
I'd suggest to use a simple dictionary to store the count like
storage = {} # initialize
# ...
if !storage.has_key(token):
storage[token] = 1
else:
storage[token] += 1
EDIT
That said, if you're using Python 3 I'd follow Space_C0wb0y's suggestion to use the Counter
class ...
Upvotes: 3