Reputation: 8530
What does dict(mapping)
actually do?
Background:
Python's docs suggest there are three possible paths when constructing a dict
, one of which is with a Mapping
.
A pandas series is similar to a dict in some ways, and coercing to a dict works as expected:
In [27]: series=pd.Series({'a':2,'b':3})
In [28]: dict(series)
Out[28]: {'a': 2, 'b': 3}
But when inside a ChainMap
, this goes awry:
In [25]: dict(ChainMap(series))
... which should be equivalent to the first expression, I think, but instead...
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in get_value(self, series, key)
1789 try:
-> 1790 return self._engine.get_value(s, k)
1791 except KeyError as e1:
pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3204)()
pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:2903)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()
KeyError: 2
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-25-ffe959c53a67> in <module>()
----> 1 dict(ChainMap(series))
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/collections/__init__.py in __getitem__(self, key)
865 for mapping in self.maps:
866 try:
--> 867 return mapping[key] # can't use 'key in mapping' with defaultdict
868 except KeyError:
869 pass
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/series.py in __getitem__(self, key)
555 def __getitem__(self, key):
556 try:
--> 557 result = self.index.get_value(self, key)
558
559 if not np.isscalar(result):
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in get_value(self, series, key)
1794
1795 try:
-> 1796 return tslib.get_value_box(s, key)
1797 except IndexError:
1798 raise
pandas/tslib.pyx in pandas.tslib.get_value_box (pandas/tslib.c:16375)()
pandas/tslib.pyx in pandas.tslib.get_value_box (pandas/tslib.c:16126)()
IndexError: index out of bounds
FWIW this does work:
In [29]: dict(ChainMap(dict(series)))
Out[29]: {'a': 2, 'b': 3}
...so ChainMap
seems to be calling parts of the interface of Series that dict
doesn't call. I can't work out what, because I can't seem to find Python-code that replicates what dict(mapping)
does.
Upvotes: 3
Views: 820
Reputation: 231738
While dict(mapping)
code is not readily available (it is compiled), the code for ChainMap
is pure Python. But is this a problem with dict()
or a problem with the use of ChainMap
?
A simpler case where a dict
works but ChainMap
does not is with an iterable
In [569]: dict([['a','b'],[1,2]])
Out[569]: {1: 2, 'a': 'b'}
In [570]: collections.ChainMap([['a','b'],[1,2]])
Out[570]: ChainMap([['a', 'b'], [1, 2]])
In [571]: collections.ChainMap([['a','b'],[1,2]])['a']
....
TypeError: list indices must be integers, not str
In [572]: collections.ChainMap(dict([['a','b'],[1,2]]))['a']
Out[572]: 'b'
In this case, ChainMap
doesn't produce the error until asked to do the indexing. It collects the list of inputs just fine.
Experimenting with the pandas series:
In [591]: ps = pd.Series(dict(a=1,b=2))
In [592]: dict(ps)
Out[592]: {'b': 2, 'a': 1}
So dict()
on series results in an ordinary looking dictionary
In [593]: collections.ChainMap(ps)
Out[593]:
ChainMap(a 1
b 2
dtype: int64)
But what is the ChainMap
on this series?
In [594]: collections.ChainMap(ps)[0]
Out[594]: 1
In [595]: collections.ChainMap(ps)['a']
Out[595]: 1
Looks like it can be indexed like the series
In [596]: collections.ChainMap(ps).maps
Out[596]:
[a 1
b 2
dtype: int64]
Its maps
attribute is just a one element list, containing the series itself. No transformation at this stage
In [597]: collections.ChainMap(ps).maps[0]
Out[597]:
a 1
b 2
dtype: int64
In [598]: dict(collections.ChainMap(ps).maps[0])
Out[598]: {'b': 2, 'a': 1}
I can construct a dictionary from this one item, same as with dict(ps)
.
So the error in dict(collections.ChainMap([dict(ps)]))
is produced somewhere in the depths of converting this ChainMap
of a pd.series into a regular dictionary. In other words, there's some nuance in how dict(ChainMap(...))
works.
The root of the problem is that this is a misuse of ChainMap
.
Iteration on the series and the dictionary produce different results:
In [614]: list(ps.__iter__())
Out[614]: [1, 2]
In [615]: list(dict(ps).__iter__())
Out[615]: ['b', 'a']
The series does have a keys
method, similar to the dictionary, but not identical:
In [619]: ps.keys()
Out[619]: Index(['a', 'b'], dtype='object')
In [620]: dict(ps).keys()
Out[620]: dict_keys(['b', 'a'])
The difference in __iter__
may be crucial. Using a dictionary comprehension:
In [623]: dd=dict(ps); {k:dd[k] for k in dd}
Out[623]: {'b': 2, 'a': 1}
But apply the same to the series directly, and I get (I think) the same index out of bounds
error - it comes from trying to do ps[2]
.
In [624]: dd=ps; {k:dd[k] for k in dd}
...
/usr/lib/python3/dist-packages/pandas/core/series.py in __getitem__(self, key)
500 def __getitem__(self, key):
501 try:
--> 502 result = self.index.get_value(self, key)
503
504 if not np.isscalar(result):
/usr/lib/python3/dist-packages/pandas/core/index.py in get_value(self, series, key)
1404
1405 try:
-> 1406 return tslib.get_value_box(s, key)
1407 except IndexError:
1408 raise
/usr/lib/python3/dist-packages/pandas/tslib.cpython-34m-i386-linux-gnu.so in pandas.tslib.get_value_box (pandas/tslib.c:12835)()
/usr/lib/python3/dist-packages/pandas/tslib.cpython-34m-i386-linux-gnu.so in pandas.tslib.get_value_box (pandas/tslib.c:12638)()
IndexError: index out of bounds
Note the same difference in iterations when using ChainMap
:
In [628]: [k for k in collections.ChainMap(ps)]
Out[628]: [1, 2]
In [629]: [k for k in collections.ChainMap(dict(ps))]
Out[629]: ['b', 'a']
or equivalently
In [651]: list(collections.ChainMap(ps).keys())
Out[651]: [1, 2]
In [652]: list(collections.ChainMap(dict(ps)).keys())
Out[652]: ['b', 'a']
It appears that dict
tries to iterate over the keys()
, while ChainMap
uses the __iter__
. If the source doesn't have keys
, what does dict
do? Maybe that's what triggers it to expect a list of tuples, or an equivalent like a 2 column array:
In [656]: dict(np.arange(6).reshape(3,2))
Out[656]: {0: 1, 2: 3, 4: 5}
ChainMap
on such an array can be indexed, but can't be converted to a dict
:
collections.ChainMap(np.arange(6).reshape(3,2))[0]
Clearly ChainMap
is a rather 'thin' wrapper around its 'maps', performing as expected when they are dictionary-like, but hit-n-miss with iterables like lists, ndarray and pd.series.
Upvotes: 0
Reputation: 310297
It looks like series aren't really true mappings... Note that iterating over the series yields the values, not the keys:
>>> list(series)
[2, 3]
collections.ChainMap
relies on the fact that iterating over a mapping should yield the keys.
Apparently, the dict
constructor doesn't rely on this fact (IIRC, it uses the .keys
method -- for which pandas returns a suitable object).
Upvotes: 2