Reputation:

Hashing a generator expression

Apparently python will allow me to hash a generator expression like (i for i in [1, 2, 3, 4, 5])

>>> hash(i for i in [1, 2, 3, 4, 5])
8735741846615

On closer inspection however, this hash value is always the same no matter what generator I put into it!

>>> hash(i for i in range(2))
8735741846615
>>> hash(i for i in [1, 2, 3, 4, 5, 6])
8735741846615
>>> hash(i for i in [0, 1, 2, 3, 4, 5, 6])
8735741846615

why is this happening? Why is hashing a generator even allowed?

I need to do this because I’m storing stuff in a dictionary cache. You access entries in the cache by passing lists of objects. But certain groups of lists with different items should still point to the same entry, since the only thing that matters here is the integer value of one attribute in each list item. So they should hash only based on those integer values, not anything relating to the items themselves, to avoid unnecessary cache misses.

I'm aware you can always just convert the expression to a tuple, but I was asking if you can bypass that by simply using the output of the generator without the tuple container, similar to how sum() could be used for such a thing.

Upvotes: 9

Answers (4)

Moses Koledoye

Reputation: 78554

generator expression

An expression that returns an iterator

If iterators are hashable, generator expressions should not be an exception. Iterators are stateful, not sure about mutable. Or would stateful mean mutable? More so, iterators are simply object/value factories and nothing more:

>>> l = [1, 2]
>>> it = iter(l)
>>> l[:] = [4, 5]
>>> next(it)
4

The iterator is not bound to the initial values of the list l. So your generator expressions are not hashed relative to the values in range(2) or [1, 2, 3, 4, 5, 6].

Hashing to the same value is apparently interpreter dependent, and is likely so since you don't have any reference to the generator expressions. I have different results for each:

>>> hash(i for i in range(2))
-2143682330
>>> hash(i for i in range(2))
-2143687189
>>> hash(i for i in [1, 2, 3, 4, 5, 6])
-2143679383

If you named the generator expressions, you would probably get different values:

>>> ge1 = (i for i in range(2))
>>> ge2 = (i for i in [1, 2, 3, 4, 5, 6])
>>> hash(ge1)
3801919
>>> hash(ge2)
-2143681547

Upvotes: 0

user2864740

Reputation: 61935

Python object allows hash and equality: hash(object()).

The 'unhashtable type' is a purposeful restriction on core mutable collections (ie. list) with a defined content-equality comparison because this avoids some common programming errors. Most notably, when used as dictionary keys.

However, the content-equality compare for a generator object is the same as the identity compare: gen == gen implies gen is gen. This is no different than using hash/== with the result of object(): it often isn't useful, and if it is, identity-hash/equality are expected.

The result of either the hash or the equality of a generator is independent upon the future materialized content. Because of this it has no content-useful == and thus needs not implement a content-useful hash. If it should be a 'unhashtable type' then becomes a gray-area at best because it is no more (or rather, is as) useful than object() as a dictionary key.

Thus it is permissible for any value to be returned for the hash of a generator (even if it is duplicated among different generator instances as long as it meets the constraint hash(x) == hash(x) -> x == x), as long as it is consistent for the given generator instance (as gen == gen -> gen is gen).

As an implementation detail, I would expect the hash of a generator to use either the same (as in, not overloaded) or similar implementation to that of a base object.

Upvotes: 0

BrenBarn

Reputation: 251428

The hash value is presumably based on the object identity, not its contents. You're probab;y seeing that result because you're not storing the generators, so they're garbage collected and their ids are re-used. Here are some other examples that show it's not just a single hash all the time:

>>> x = (i for i in [0, 1, 2, 3, 4, 5, 6])
>>> y = (i for i in [0, 1, 2, 3, 4, 5, 6])
>>> hash(x)
-9223372036852064692
>>> hash(y)
-9223372036852064683
>>> id(x)
43377864
>>> x = (i for i in [0, 1, 2, 3, 4, 5, 6])
>>> id(x)
43378296
>>> hash(x)
-9223372036852064665

As for why Python lets you hash them, it's the same reason it lets you hash all sorts of objects: so you can tell different objects apart and use them as dictionary keys. It's not meant to tell you about what the generator does, just what object it is. It's no different than this:

>>> hash(object())
225805
>>> hash(object())
225805
>>> x = object()
>>> y = object()
>>> hash(x)
225805
>>> hash(y)
225806

The idea that if an object is hashable then it is immutable is something of a misconception. Objects can define any sort of hashing they want as long as it is compatible with their own equality definition. Instances of user-defined classes, for instance, are hashable in a manner similar to these generator functions (hash based on object id).

People sometimes seem to think that the rule is "if an object is mutable, it is not hashable", but that's not correct. The actual rule is more like "if an object is not hashable, then it is mutable". (More specifically, it defines a notion of equality that depends on mutable state, or it explicitly bars hashing for some perverse reason of its own.) In other words, it doesn't usually make sense to ask why a certain hashable type is hashable; it only makes sense to ask why an unhashable type is unhashable. The built-in mutable types list, dict, and set have a specific reason for being unhashable: they are meant to be compared based on their (mutable) values, not object id. But just because some other type has some internal state doesn't mean it can't be hashable. You can define objects that are mutable and hashable all day long. They just have to define equality and hashability in compatible ways. Generators indeed cannot be compared for equality based on their contents either:

>>> x = (i for i in [0, 1, 2, 3, 4, 5, 6])
>>> y = (i for i in [0, 1, 2, 3, 4, 5, 6])
>>> x == y
False

For generators, hashing based on values would be uniquely nonsensical, because the values don't exist until they're actually yielded from the generator. If hashing were based on values, there would be no way to hash a generator that, for instance, used random numbers to decide what to yield next.

Upvotes: 3

MisterMiyagi

Reputation: 51979

So there are two questions here actually:

Why do generators have a hash, when e.g. lists do not?
Why do you always get the same hash?

For 2, the answer is simple: In this case, hash is based on the object's id. Since you don't actually store the object, its memory gets reused. That means the next generator has the same id and thus hash.

For 1, the answer is "because they can". hash is primarily meant for use in dict, set and other situations where it allows identifying an object. These situations set the constraint that a == b also implies hash(a) == hash(b) (the reverse is not constrained).

Now, for list, dict and other collections, equality is based on content. [1,2,3] == [1,2,3] regardless whether both are the same objects, for example. This means if something is added to them, their equality changes and thus their hash would change as well. Thus, hash is undefined, as it must be a constant for it to work in dict etc.

In contrast, a generator can have any content. Consider for example a generator providing random values. Thus, it makes no sense to compare generators by content. They are only ever compared by identity. So, a == b equals id(a) == id(b) for generators. In turn, this means basing hash(a) on id(a) will always satisfy the constraint by equality on hash.

Upvotes: 6

Hashing a generator expression

Answers (4)

Related Questions