Reputation: 94565
When creating a Pandas dataframe with a MultiIndex, the levels seem to always be sorted:
>>> pd.DataFrame([range(4)], columns=pd.MultiIndex.from_product([["b", "a"], [20, 10]]))
b a
20 10 20 10
0 0 1 2 3
>>> _.columns
MultiIndex(levels=[[u'a', u'b'], [10, 20]],
labels=[[1, 1, 0, 0], [1, 0, 1, 0]])
(Note how levels
is sorted.) Is this guaranteed? Knowing this can help write robust code (since we can then rely on a simple property of MultiIndices).
I can't find any guarantee in the documentation (but then this doesn't mean that it couldn't be there!).
There are also old examples (from 2015) that show a different behavior, but maybe does Pandas now offer guarantees on the ordering of levels (in the same way as Python 3.6 offers a guarantee on the order of keys in dictionaries)?
Upvotes: 8
Views: 253
Reputation: 2724
When creating a MultiIndex
using from_product()
or from_arrays()
levels will be sorted because both methods use _factorize_from_iterables()
which returns the indexes sorted.
>> list(_factorize_from_iterables([["b", "a"], [20, 10]]))
[[array([1, 0], dtype=int8), array([1, 0], dtype=int8)],
[Index(['a', 'b'], dtype='object'), Int64Index([10, 20], dtype='int64')]]
MultiIndex.from_tuples()
will also have sorted levels because it uses from_arrays()
internally.
If you set MultiIndex
without specifying a method however, levels won't be sorted.
>> midx = pd.MultiIndex(levels=[['b', 'a'], [20, 10]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>> df = pd.DataFrame(np.random.randn(4,4), columns=midx)
>> df.columns
MultiIndex(levels=[['b', 'a'], [20, 10]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Above uses pandas
version 0.22.0
(released in december 29, 2017) and is tested on version 0.23.4
(latest release).
Upvotes: 3