nikhil
nikhil

Reputation: 9363

Python Datastructure Suggestion

i am building a small search engine to search a collection of pdfs. From each pdf i extract a set of tokens and store it in database. I do not want to store duplicate tokens in database, instead i want to store count of each token in the database. Does python has any special datastructure that do not store duplicates but stores the counts of each token?

Upvotes: 2

Views: 144

Answers (4)

Jakob Bowyer
Jakob Bowyer

Reputation: 34688

You could always implement an object for every file, giving it a number of methods, like open and display and etc etc. You could then define __hash__ and __eq__ for the object, this would allow you to store items in a set, causing the duplicates to just update a single instance inside the set.

This is just another way of doing something by no means is it the best method.

Upvotes: 0

Björn Pollex
Björn Pollex

Reputation: 76778

Python >=2.7 has the Counter.

Upvotes: 5

Martin Thurau
Martin Thurau

Reputation: 7644

The collections package has defaultdict which can be used as a key-value storage with a counter:

>>> s = 'mississippi'
>>> d = defaultdict(int)
>>> for k in s:
...     d[k] += 1
...
>>> d.items()
[('i', 4), ('p', 2), ('s', 4), ('m', 1)]

Just so notice: This is not a databse, it's a pure in memory storage. You would have to save this data somehow!

Upvotes: 3

MartinStettner
MartinStettner

Reputation: 29164

I'd suggest to use a simple dictionary to store the count like

storage = {} # initialize
# ...
if !storage.has_key(token):
  storage[token] = 1
else:
  storage[token] += 1

EDIT

That said, if you're using Python 3 I'd follow Space_C0wb0y's suggestion to use the Counter class ...

Upvotes: 3

Related Questions