Reputation: 54734
I thought I understood the relationship between set
and frozenset
in Python, but something magic is happening with set membership (set1 in set2
) and I don't see how it can work.
A set
of frozenset
works:
>>> s = set()
>>> s.add(frozenset(['hello', 'world']))
>>> frozenset(['hello', 'world']) in s
True
I can't add unhashable types, such as list
, to my set
, and I can't use the in
operator with unhashable types either:
>>> s.add(['hello', 'world'])
TypeError: unhashable type: 'list'
>>> ['hello', 'world'] in s
TypeError: unhashable type: 'list'
Likewise I can't add a set
to my set:
>>> s.add({'hello', 'world'})
TypeError: unhashable type: 'set'
...but I can use in
with a set
to check whether the corresponding frozenset
is a member:
>>> {'hello', 'world'} in s
True
...and it gives the right result:
>>> {'jello', 'world'} in s
False
Why is set1 in set2
special? Is it really calculating the hash of my set before testing membership? Or is it falling back on brute force?
Edit: found it, the C implementation of set.__contains__
knows how to call call key = frozenset(key)
whenever hash(key)
throws TypeError
and isinstance(key, set)
.
https://github.com/python/cpython/blob/6c7d67c/Objects/setobject.c#L1890-L1897
rv = set_contains_key(so, key);
if (rv < 0) {
if (!PySet_Check(key) || !PyErr_ExceptionMatches(PyExc_TypeError))
return -1;
PyErr_Clear();
tmpkey = make_new_set(&PyFrozenSet_Type, key);
if (tmpkey == NULL)
return -1;
rv = set_contains_key(so, tmpkey);
Upvotes: 4
Views: 78
Reputation: 59711
Looking at the source code, the current implementation (3.7) of set_contains
reveals that indeed set objects are converted to frozen sets when checking for membership in a set:
static int
set_contains(PySetObject *so, PyObject *key)
{
PyObject *tmpkey;
int rv;
rv = set_contains_key(so, key);
if (rv < 0) {
if (!PySet_Check(key) || !PyErr_ExceptionMatches(PyExc_TypeError))
return -1;
PyErr_Clear();
tmpkey = make_new_set(&PyFrozenSet_Type, key);
if (tmpkey == NULL)
return -1;
rv = set_contains_key(so, tmpkey);
Py_DECREF(tmpkey);
}
return rv;
}
Basically, if the given object is a set (PySet_Check(key)
), then a new frozen set is created (make_new_set(&PyFrozenSet_Type, key)
) and membership is checked again (set_contains_key(so, tmpkey)
). I do not think this is actually documented anywhere, and it is probably meant to be a feature that "just works" without you even noticing if you do not think too hard about it.
It does seem to be something in the same spirit as the rejected PEP-0351, although it is not based on any special method and it is specific to sets only.
EDIT: For more details, this feature was apparently first implemented back in 2003 (commit) by Raymond Hettinger at the request of Alex Martelli. So if one of them happens to be around maybe they could give more background on this.
EDIT 2: Worth noting, this can have a significant performance impact under particular circumstances:
s = set(range(100000))
sf = frozenset(s)
t = { sf }
%timeit sf in t # True
>>> 31.6 ns ± 1.04 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit s in t # True
>>> 4.9 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The test is five orders of magnitude slower in the second case!
EDIT 3: I raised an issue for the Python team to discuss the possibility of deprecating this behavior, since it seems inconsistent and potentially problematic. The developer considered it was still a useful feature and the cons were not worth breaking compatibility.
Upvotes: 4