user2309803
user2309803

Reputation: 645

Why does pandas change the index value in this example?

First we create a raw dataset with MultiIndex-

In [166]: import numpy as np; import pandas as pd 

In [167]: data_raw = pd.DataFrame([ 
     ...: {'frame': 1, 'face': np.NaN, 'lmark': np.NaN, 'x': np.NaN, 'y': np.NaN}, 
     ...: {'frame': 197, 'face': 0, 'lmark': 1, 'x': 969, 'y': 737}, 
     ...: {'frame': 197, 'face': 0, 'lmark': 2, 'x': 969, 'y': 740}, 
     ...: {'frame': 197, 'face': 0, 'lmark': 3, 'x': 970, 'y': 744}, 
     ...: {'frame': 197, 'face': 0, 'lmark': 4, 'x': 972, 'y': 748}, 
     ...: {'frame': 197, 'face': 0, 'lmark': 5, 'x': 973, 'y': 752}, 
     ...: {'frame': 300, 'face': 0, 'lmark': 1, 'x': 745, 'y': 367},  
     ...: {'frame': 300, 'face': 0, 'lmark': 2, 'x': 753, 'y': 411},  
     ...: {'frame': 300, 'face': 0, 'lmark': 3, 'x': 759, 'y': 455}, 
     ...: {'frame': 301, 'face': 0, 'lmark': 1, 'x': 741, 'y': 364},   
     ...: {'frame': 301, 'face': 0, 'lmark': 2, 'x': 746, 'y': 408},   
     ...: {'frame': 301, 'face': 0, 'lmark': 3, 'x': 750, 'y': 452}]).set_index(['frame', 'face', 'lmark'])

Next we calculate the z-scores for each lmark -

In [168]: ((data_raw - data_raw.mean(level='lmark')).abs()) / data_raw.std(level='lmark')            
Out[168]: 
                         x         y
frame face lmark                    
1     NaN  NaN         NaN       NaN
197   0.0  1.0    1.154565  1.154672
           2.0    1.154260  1.154665
           3.0    1.153946  1.154654
           4.0         NaN       NaN
           5.0         NaN       NaN
300   0.0  1.0    0.561956  0.570343
           2.0    0.549523  0.569472
           3.0    0.540829  0.568384
301   0.0  1.0    0.592609  0.584329
           2.0    0.604738  0.585193
           3.0    0.613117  0.586270

The index values don't change, as expected. Now we filter out records where lmark > 3 -

In [170]: data_filtered = data_raw.loc[(slice(None), slice(None), [np.NaN, slice(3)]),:]

In [171]: data_filtered                                                                          
Out[171]: 
                      x      y
frame face lmark              
1     NaN  NaN      NaN    NaN
197   0.0  1.0    969.0  737.0
           2.0    969.0  740.0
           3.0    970.0  744.0
300   0.0  1.0    745.0  367.0
           2.0    753.0  411.0
           3.0    759.0  455.0
301   0.0  1.0    741.0  364.0
           2.0    746.0  408.0
           3.0    750.0  452.0

and recalculate the z-scores -

In [172]: ((data_filtered - data_filtered.mean(level='lmark')).abs()) / data_filtered.std(level='lmark')                                                                                       
Out[172]: 
                         x         y
frame face lmark                    
1     NaN  1.0         NaN       NaN
197   0.0  1.0    1.154565  1.154672
           2.0    1.154260  1.154665
           3.0    1.153946  1.154654
300   0.0  1.0    0.561956  0.570343
           2.0    0.549523  0.569472
           3.0    0.540829  0.568384
301   0.0  1.0    0.592609  0.584329
           2.0    0.604738  0.585193
           3.0    0.613117  0.586270

Why has the value of the first record's lmark index changed from NaN to 1.0?

Upvotes: 0

Views: 52

Answers (1)

jezrael
jezrael

Reputation: 862441

I think it seems bug.

Solution is use MultiIndex.remove_unused_levels:

data_filtered.index = data_filtered.index.remove_unused_levels()
a = ((data_filtered - data_filtered.mean(level='lmark')).abs()) / data_filtered.std(level='lmark')
print (a)
                         x         y
frame face lmark                    
1     NaN  NaN         NaN       NaN
197   0.0  1.0    1.154565  1.154672
           2.0    1.154260  1.154665
           3.0    1.153946  1.154654
300   0.0  1.0    0.561956  0.570343
           2.0    0.549523  0.569472
           3.0    0.540829  0.568384
301   0.0  1.0    0.592609  0.584329
           2.0    0.604738  0.585193
           3.0    0.613117  0.586270

Upvotes: 1

Related Questions