Maximilian
Maximilian

Reputation: 8530

How does python's dict constructor handle Mappings?

What does dict(mapping) actually do?

Background:

Python's docs suggest there are three possible paths when constructing a dict, one of which is with a Mapping.

A pandas series is similar to a dict in some ways, and coercing to a dict works as expected:

In [27]: series=pd.Series({'a':2,'b':3})

In [28]: dict(series)
Out[28]: {'a': 2, 'b': 3}

But when inside a ChainMap, this goes awry:

In [25]: dict(ChainMap(series))

... which should be equivalent to the first expression, I think, but instead...

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in get_value(self, series, key)
   1789         try:
-> 1790             return self._engine.get_value(s, k)
   1791         except KeyError as e1:

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3204)()

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:2903)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()

KeyError: 2

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
<ipython-input-25-ffe959c53a67> in <module>()
----> 1 dict(ChainMap(series))

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/collections/__init__.py in __getitem__(self, key)
    865         for mapping in self.maps:
    866             try:
--> 867                 return mapping[key]             # can't use 'key in mapping' with defaultdict
    868             except KeyError:
    869                 pass

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/series.py in __getitem__(self, key)
    555     def __getitem__(self, key):
    556         try:
--> 557             result = self.index.get_value(self, key)
    558 
    559             if not np.isscalar(result):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/index.py in get_value(self, series, key)
   1794 
   1795             try:
-> 1796                 return tslib.get_value_box(s, key)
   1797             except IndexError:
   1798                 raise

pandas/tslib.pyx in pandas.tslib.get_value_box (pandas/tslib.c:16375)()

pandas/tslib.pyx in pandas.tslib.get_value_box (pandas/tslib.c:16126)()

IndexError: index out of bounds

FWIW this does work:

In [29]: dict(ChainMap(dict(series)))
Out[29]: {'a': 2, 'b': 3}

...so ChainMap seems to be calling parts of the interface of Series that dict doesn't call. I can't work out what, because I can't seem to find Python-code that replicates what dict(mapping) does.

Upvotes: 3

Views: 820

Answers (2)

hpaulj
hpaulj

Reputation: 231738

While dict(mapping) code is not readily available (it is compiled), the code for ChainMap is pure Python. But is this a problem with dict() or a problem with the use of ChainMap?


A simpler case where a dict works but ChainMap does not is with an iterable

In [569]: dict([['a','b'],[1,2]])
Out[569]: {1: 2, 'a': 'b'}
In [570]: collections.ChainMap([['a','b'],[1,2]])
Out[570]: ChainMap([['a', 'b'], [1, 2]])
In [571]: collections.ChainMap([['a','b'],[1,2]])['a']
 ....
TypeError: list indices must be integers, not str
In [572]: collections.ChainMap(dict([['a','b'],[1,2]]))['a']
Out[572]: 'b'

In this case, ChainMap doesn't produce the error until asked to do the indexing. It collects the list of inputs just fine.


Experimenting with the pandas series:

In [591]: ps = pd.Series(dict(a=1,b=2))
In [592]: dict(ps)
Out[592]: {'b': 2, 'a': 1}

So dict() on series results in an ordinary looking dictionary

In [593]: collections.ChainMap(ps)
Out[593]: 
ChainMap(a    1
b    2
dtype: int64)

But what is the ChainMap on this series?

In [594]: collections.ChainMap(ps)[0]
Out[594]: 1
In [595]: collections.ChainMap(ps)['a']
Out[595]: 1

Looks like it can be indexed like the series

In [596]: collections.ChainMap(ps).maps
Out[596]: 
[a    1
 b    2
 dtype: int64]

Its maps attribute is just a one element list, containing the series itself. No transformation at this stage

In [597]: collections.ChainMap(ps).maps[0]
Out[597]: 
a    1
b    2
dtype: int64
In [598]: dict(collections.ChainMap(ps).maps[0])
Out[598]: {'b': 2, 'a': 1}

I can construct a dictionary from this one item, same as with dict(ps).

So the error in dict(collections.ChainMap([dict(ps)])) is produced somewhere in the depths of converting this ChainMap of a pd.series into a regular dictionary. In other words, there's some nuance in how dict(ChainMap(...)) works.

The root of the problem is that this is a misuse of ChainMap.


Iteration on the series and the dictionary produce different results:

In [614]: list(ps.__iter__())
Out[614]: [1, 2]
In [615]: list(dict(ps).__iter__())
Out[615]: ['b', 'a']

The series does have a keys method, similar to the dictionary, but not identical:

In [619]: ps.keys()
Out[619]: Index(['a', 'b'], dtype='object')
In [620]: dict(ps).keys()
Out[620]: dict_keys(['b', 'a'])

The difference in __iter__ may be crucial. Using a dictionary comprehension:

In [623]: dd=dict(ps); {k:dd[k] for k in dd}
Out[623]: {'b': 2, 'a': 1}

But apply the same to the series directly, and I get (I think) the same index out of bounds error - it comes from trying to do ps[2].

In [624]: dd=ps; {k:dd[k] for k in dd}
...
/usr/lib/python3/dist-packages/pandas/core/series.py in __getitem__(self, key)
    500     def __getitem__(self, key):
    501         try:
--> 502             result = self.index.get_value(self, key)
    503 
    504             if not np.isscalar(result):

/usr/lib/python3/dist-packages/pandas/core/index.py in get_value(self, series, key)
   1404 
   1405             try:
-> 1406                 return tslib.get_value_box(s, key)
   1407             except IndexError:
   1408                 raise

/usr/lib/python3/dist-packages/pandas/tslib.cpython-34m-i386-linux-gnu.so in pandas.tslib.get_value_box (pandas/tslib.c:12835)()

/usr/lib/python3/dist-packages/pandas/tslib.cpython-34m-i386-linux-gnu.so in pandas.tslib.get_value_box (pandas/tslib.c:12638)()

IndexError: index out of bounds

Note the same difference in iterations when using ChainMap:

In [628]: [k for k in collections.ChainMap(ps)]
Out[628]: [1, 2]
In [629]: [k for k in collections.ChainMap(dict(ps))]
Out[629]: ['b', 'a']

or equivalently

In [651]: list(collections.ChainMap(ps).keys())
Out[651]: [1, 2]
In [652]: list(collections.ChainMap(dict(ps)).keys())
Out[652]: ['b', 'a']

It appears that dict tries to iterate over the keys(), while ChainMap uses the __iter__. If the source doesn't have keys, what does dict do? Maybe that's what triggers it to expect a list of tuples, or an equivalent like a 2 column array:

In [656]: dict(np.arange(6).reshape(3,2))
Out[656]: {0: 1, 2: 3, 4: 5}

ChainMap on such an array can be indexed, but can't be converted to a dict:

collections.ChainMap(np.arange(6).reshape(3,2))[0]

Clearly ChainMap is a rather 'thin' wrapper around its 'maps', performing as expected when they are dictionary-like, but hit-n-miss with iterables like lists, ndarray and pd.series.

Upvotes: 0

mgilson
mgilson

Reputation: 310297

It looks like series aren't really true mappings... Note that iterating over the series yields the values, not the keys:

>>> list(series)
[2, 3]

collections.ChainMap relies on the fact that iterating over a mapping should yield the keys.

Apparently, the dict constructor doesn't rely on this fact (IIRC, it uses the .keys method -- for which pandas returns a suitable object).

Upvotes: 2

Related Questions