Thriveth
Thriveth

Reputation: 405

Pandas: df.set_value() method erases / resets column names of MultiIndex

I am writing an application that makes use of pandas (version 0.10.1) to store the underlying data model as a (3-level) MultiIndex'ed DataFrame. The model is a line spectrum, and the top level of the index is the atomic transition.

A simple dataframe could look like this:

                               Pos     Sigma       Ampl  Line center Identifier
H-alpha-6697.6 30-30 Comp2  -3.600  0.774000  33.058000       6699.5          b
                     Comp3   3.538  2.153000  28.054000       6699.5          c
                     Contin    NaN       NaN   0.000000          NaN        NaN
                     Comp4   1.384  0.921000  37.504000       6699.5          d
                     Comp1  -2.124  1.977000  69.166000       6699.5          a
               31-31 Comp2  -3.292  0.884603  49.813423       6699.5          b
                     Comp3   3.600  2.299000  19.999000       6699.5          c
                     Contin    NaN       NaN   0.000000          NaN        NaN
                     Comp4   1.692  1.009000  22.222000       6699.5          d
                     Comp1  -1.262  2.534000  68.002000       6699.5          a

At some point, I need to be able to create a different transition, e.g. H-beta, using H-alpha as a template. I would ideally do this by something like df.ix['H-beta-wavelength'] = df.ix['H-alpha-6697.6'], but this is not possible to do. So instead, I tried following this example: Prepend a level to a pandas MultiIndex

However, the example above requires the .names of the multiindex levels to be set in order to reorder them. And the names attribute is set when initializing the dataframe, but during the building of it, I rely quite extensibly on the set_values() method, and doing this destroys the names attribute - or rather sets them to [None, None, None].

Example:

In [68]: df
Out[68]: 
                                  Pos  Sigma     Ampl  Line center Identifier
Transition     Rows  Component                                               
Center: 6699.5 26-26 Comp2     -3.846  0.657  15.2740       6699.5          b
                     Comp3      2.924  1.449  31.3930       6699.5          c
                     Contin       NaN    NaN   0.0000          NaN        NaN
                     Comp4      8.030  1.009   7.0831       6699.5          d
                     Comp1     -1.816  2.153  50.2750       6699.5          a

In [69]: df.set_value(('Center: 5044.3', '26-26', 'Comp1'), 'Sigma', 2.457)
Out[69]: 
                               Pos  Sigma     Ampl  Line center Identifier
Center: 6699.5 26-26 Comp2  -3.846  0.657  15.2740       6699.5          b
                     Comp3   2.924  1.449  31.3930       6699.5          c
                     Contin    NaN    NaN   0.0000          NaN        NaN
                     Comp4   8.030  1.009   7.0831       6699.5          d
                     Comp1  -1.816  2.153  50.2750       6699.5          a
Center: 5044.3 26-26 Comp1     NaN  2.457      NaN          NaN        NaN

Of course, this makes it quite hard to use the names for reordering the levels of the multiindex. Is there a way to avoid this, short of brute-force setting the names after each time I've run set_values()?

EDIT: simpler, reproducible example.

Here is an iPython session recreating the index.names problem with a somewhat simpler example. It also shows that it is possibly a bug that goes beyond index.names, as it seems to change the index.lexsort_depth from 3 to 0. Missing numbers in the prompt are just unnecessary views of the dataframe. I believe that one must choose secondary and/or tertiary indices that already exist like I have done below in order to reproduce it.

In [4]: idx = pd.MultiIndex.from_arrays(
            [['Hans']*4 + ['Grethe']*4, ['1', '1', '2', '2']*2, ['a', 'b']*4], 
            names=['Name', 'Number', 'Letter'])

In [5]: df = pd.DataFrame(
            random.random((8, 3)), 
            columns=['one', 'two','three'], 
            index=idx)


In [6]: df
Out[6]: 
                           one       two     three
Name   Number Letter                              
Hans   1      a       0.803566  0.434574  0.805976
              b       0.655322  0.208469  0.989559
       2      a       0.893952  0.380358  0.173764
              b       0.822446  0.673894  0.676573
Grethe 1      a       0.202641  0.387263  0.405296
              b       0.646733  0.086953  0.882114
       2      a       0.358458  0.147107  0.769586
              b       0.183782  0.477863  0.601098

# To rule out another possible source of problems:
In [9]: df.unstack().drop(('Grethe', '1')).stack()
Out[9]: 
                           one       two     three
Name   Number Letter                              
Grethe 2      a       0.358458  0.147107  0.769586
              b       0.183782  0.477863  0.601098
Hans   1      a       0.803566  0.434574  0.805976
              b       0.655322  0.208469  0.989559
       2      a       0.893952  0.380358  0.173764
              b       0.822446  0.673894  0.676573

In [10]: df.set_value(('Frans', '2', 'b'), 'one', 23.)
Out[10]: 
                  one       two     three
Hans   1 a   0.803566  0.434574  0.805976
         b   0.655322  0.208469  0.989559
       2 a   0.893952  0.380358  0.173764
         b   0.822446  0.673894  0.676573
Grethe 1 a   0.202641  0.387263  0.405296
         b   0.646733  0.086953  0.882114
       2 a   0.358458  0.147107  0.769586
         b   0.183782  0.477863  0.601098
Frans  2 b  23.000000       NaN       NaN

In [11]: df = df.sortlevel(level='Name')

In [13]: df.index.lexsort_depth
Out[13]: 3

In [14]: df.set_value(('Frans', '2', 'b'), 'one', 23.).index.lexsort_depth
Out[14]: 0

Upvotes: 0

Views: 3437

Answers (2)

Thriveth
Thriveth

Reputation: 405

So according to Andy Hayden, this is a names bug in pandas. Hopefully a fix will come soon.

Until then, I believe the best way to do this is to do the following:

tmp = df.ix['ExistingTransition'].copy()
tmp['Transition'] = 'NewTransition'
tmp = tmp.set_index('Transition', append=True)
tmp.index = tmp.index.reorder_levels([2, 0, 1])
# ...Do whatever else needs to be done to this before applying as template...
df = df.append(tmp)

...That, or making sure thet the names attribute is recreated after each run of set_values(), and then just going by the example linked in the question.

Upvotes: 0

Jeff
Jeff

Reputation: 129008

Your index needs to be sorted! See docs here: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-need-for-sortedness and these recipes may help http://pandas.pydata.org/pandas-docs/dev/cookbook.html This is 0.10.1 as well

Heres a sorted frame

In [26]: index = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
              names=['first', 'second'])

In [27]: df = pd.DataFrame(np.random.rand(len(index)), index=index,columns=['A'])

In [7]: df.index.lexsort_depth
Out[7]: 2

In [28]: df.set_value(('a',1),'A',1)
Out[28]: 
                     A
first second          
a     1       1.000000
      2       0.136456
b     1       0.712612
      2       0.818473

And if I sort by the 2nd level (so its unsorted)

In [29]: df2 = df.sortlevel(level='second')

# this is not sorted! (well it is, just not lexsorted)
In [10]: df2.index.lexsort_depth
Out[10]: 0

In [30]: df2.set_value(('b','1'),'A',2)
Out[30]: 
            A
a 1  1.000000
b 1  0.712612
a 2  0.136456
b 2  0.818473
  1  2.000000

Upvotes: 1

Related Questions