user2727167
user2727167

Reputation: 440

pandas multiindex add new column performance issue

I have a dataframe like this:

                           Band1       lat1       lon1
latitude level longitude                              
41.0     1000  19.50        23.0  41.015335  19.548331
               19.50        44.0  41.015335  19.565497
               19.50        12.0  41.015335  19.582663
               19.75        35.0  41.015335  19.668494
               19.75        83.0  41.015335  19.685660

I want to add following column to multiIndex ( this is DatetimeIndex Type ):

DatetimeIndex(['1979-01-01 00:00:00', '1979-01-01 01:00:00',
               '1979-01-01 02:00:00', '1979-01-01 03:00:00',
               '1979-01-01 04:00:00', '1979-01-01 05:00:00',
               '1979-01-01 06:00:00', '1979-01-01 07:00:00',
               '1979-01-01 08:00:00', '1979-01-01 09:00:00',
               ...
               '2019-12-30 15:00:00', '2019-12-30 16:00:00',
               '2019-12-30 17:00:00', '2019-12-30 18:00:00',
               '2019-12-30 19:00:00', '2019-12-30 20:00:00',
               '2019-12-30 21:00:00', '2019-12-30 22:00:00',
               '2019-12-30 23:00:00', '2019-12-31 00:00:00'],
              dtype='datetime64[ns]', length=179305, freq=None)

I tried procedure described here but it takes hours of looping without result ( probably due to large number of rows - 179305 in this case. Desired results would be:

                                                    Band1       lat1       lon1
latitude level longitude time                 
41.0     1000  19.50    '1979-01-01 00:00:00'       23.0  41.015335  19.548331
                        '1979-01-01 01:00:00'       23.0  41.015335  19.548331
                        '1979-01-01 02:00:00'       23.0  41.015335  19.548331
                        '1979-01-01 03:00:00'       23.0  41.015335  19.548331
                        '1979-01-01 04:00:00'       23.0  41.015335  19.548331
                        ...                         ...     ...         ...

               19.60    '1979-01-01 00:00:00'       44.0  41.015335  19.565497
                        '1979-01-01 01:00:00'       44.0  41.015335  19.565497
                        '1979-01-01 02:00:00'       44.0  41.015335  19.565497
                        '1979-01-01 03:00:00'       44.0  41.015335  19.565497
                        '1979-01-01 04:00:00'       44.0  41.015335  19.565497
                        ...                         ....
               19.65                                12.0  41.015335  19.582663
               19.75                                35.0  41.015335  19.668494
               19.75                                83.0  41.015335  19.685660
                                                    ...        ...        ...
46.5     850   23.00                                1280.0  46.491333  23.015891
               23.00                                1390.0  46.491333  23.033057
               23.00                                1508.0  46.491333  23.050223
               23.00                                1519.0  46.491333  23.067389
               23.00                                1544.0  46.491333  23.084556

Main issue is speed, hence for looping is not an option. Any help is appreciated.

Upvotes: 0

Views: 60

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

You want append option in set_index:

# toy data
idx = pd.MultiIndex.from_arrays([list('aabbcc'), list('111111')], names=['x','y'])
df = pd.DataFrame(np.arange(18).reshape(-1,3), 
                  index=idx,
                  columns=list('abc'))

times = [11,22]

# calculate multiplicity of the last index
multi = len(df.index)//len(times)


df = (df.assign(time=np.tile(times, multi))         # replace [0,1,2,3,4] with your datetime series
        .set_index('time', append=True)
     )

Output:

           a   b   c
x y time            
a 1 11     0   1   2
    22     3   4   5
b 1 11     6   7   8
    22     9  10  11
c 1 11    12  13  14
    22    15  16  17

Upvotes: 2

Related Questions