Shaokan
Shaokan

Reputation: 7684

python should I use generator for this case?

I have a list of almost 2k dictionaries inside it. And I am using the list several times. For instance:

c = myClass()
c.create(source) # where source is a text of approximately 50k chars
                 # this method creates the list that has approximately 2k dictionaries
item = c.get(15012) # now, this one loops thru the list to find an item
                    # whenever the condition is matched, the for loop is broken and the value is returned
item2 = c.prevItem(item) # this one also loops thru the list by reversing it and bringing the next item

Now, imagine this scenario where I have the use the same list over and over again. Since the list is large I'd like to use a generator but as far as I've understood, generators have to be recreated when they throw StopIteration. So basically, in this scenario, is it convenient to use a generator or is there a more efficient way in terms of speed?

Upvotes: 4

Views: 145

Answers (6)

Jochen Ritzel
Jochen Ritzel

Reputation: 107666

You should use a list because you can do one trivial optimization with it: Sort it by the attribute you're looking for (in .get) and do a binary search.

In a list of 2000 items the average number of comparisons goes down from 1000 to 10! Getting the previous (and next) item becomes trivial too.

See the bisect module for the bisection algorithm.

Upvotes: 0

Roshan Mathews
Roshan Mathews

Reputation: 5898

You can try this subclass of OrderedDict. My earlier submission was incorrect (mentioned at the bottom):

from collections import OrderedDict

class MyOrderedDict(OrderedDict):
    def index(self, key):
        if key not in self.keys():
            raise KeyError
        return list(d.keys()).index(key)
    def prev(self, key):
        idx = self.index(key) - 1
        if idx < 0:
            raise IndexError
        return list(d.keys())[idx]
    def next(self, key):
        _list = list(d.keys())
        idx = self.index(key)
        if idx > len(_list):
            raise IndexError
        return _list[idx+1]

# >>> d = MyOrderedDict(((3, 'Three'), (2, 'Two'), (4, 'Four'), (1, 'One')))
# >>> d.index(3)
# 0
# >>> d.index(2)
# 1
# >>> d.prev(2)
# 3
# >>> d.prev(3)
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "<stdin>", line 9, in prev
# IndexError
# >>> d.next(4)
# 1
# >>> d.next(1)
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "<stdin>", line 16, in next
# IndexError: list index out of range

Edit - as @agf commented below, this is incorrect.

You're looking for a fast way to retrieve an item from myClass, so you should use a dictionary. But at the same time you want the data to be in some sort of order, so that you can do a prevItem on it. Why don't you store your data in a collections.OrderedDict added in Python 2.7, 3.1. ref

Upvotes: 1

Profane
Profane

Reputation: 1128

A list of two thousand dictionaries is quite normal. A typical website admin has many such lists, I'd imagine. If you seldom have to deal with problems like this, you might be fine with an ad hoc solution-- it may be worth considering a dictionary of dictionaries too so you don't have to iterate through every key every time. But the more routine way to address this data structure, from what I gather, is to use a database. Each of your dictionaries can have some key (ideally the condition you're checking for in your loop). The database can be instructed to index the data by this key and if you look at the work it does to retrieve the dictionary you want, you may be surprised to find the answer is almost none-- it pretty much just cuts the deck to the card you requested, so to speak (though it does have to do some work to setup the index, which is something like a sort operation).

Python offers many great ways to map code to databases of all kinds. Check out the powerful, but complex sqlalchemy, the built-in std library sqlite3 module, or join me in experimenting with mongoengine and nosql databases. (There are many many more too of course, but you can easily find another post here with a general overview). Good luck.

Upvotes: 1

agf
agf

Reputation: 176910

If you frequently get an item at a known offset from a previously retrieved item, is to change .get to return not only the item, but it's position in the list. Then you could implement prevItem as:

def previtem(self, pos):
    return self.itemlist[pos - 1]

item, pos = c.get(itemnum)
item2 = c.prevItem(pos)

If, instead, you are doing some sort of operation on item to get a new itemnum, you should store them in a dict instead of a list. This way, get is just a dictionary lookup (much faster than list search):

def get(self, itemnum):
    return self.premade_dict[itemnum]

So one way or the other you should be able to replace some searches with cheaper operations.

Upvotes: 3

J&#252;rgen Strobel
J&#252;rgen Strobel

Reputation: 2248

Depends how you want to use a generator. Generators are good at only executing code when it is really needed. Seems your for loop with break already does this.

You could change your class interface though.

def getItems(cond):
    # find item, remember index
    yield item
    # find previous item, possibly much more efficient with the index
    yield previtem

Now upon calling getItems(), you can walk the returned generator for 1 or 2 items and only as much code as needed will be executed.

Upvotes: 1

Owen
Owen

Reputation: 39366

It sounds to me like you have to decide which you'd rather do:

1) Save the values so you don't have to recalculate them, but use more space to do so.

2) Recalculate them each time, but save on space because you don't have to store them.

If you think about it, no matter what kind of generator/list/whatever you're using, one of those two things has to happen. And I don't think there's a simple hard rule to say which is better. (Personally I'd say pick one and don't look back. You have your whole life ahead of you.)

Upvotes: 5

Related Questions