crashMOGWAI
crashMOGWAI

Reputation: 692

Pandas: Need a speedier way of index slicing

Anyone care to take a stab at speeding up this dataframe index slicing scheme? I'm trying to slice and dice some huge dataframes, so every bit counts. I need to somehow find a faster way of index slicing the dataframe, other than the following technique:

v = initFrame.xs(x,level=('ifoo2','ifoo3'), drop_level=False) 

Also the loop in pd.unique is impacting performance pretty significantly.

uniqueList = list(pd.unique(initFrame[['bar1','bar4']].values))

Copy and paste the below snippet to avoid setup.

import pandas as pd

foo1 = (['LABEL1','LABEL1','LABEL2','LABEL2'])
foo2 = ([5,5,6,6])
foo3 = ([1,1,2,3])

index = pd.MultiIndex.from_arrays([foo1,foo2,foo3], names=['ifoo1','ifoo2','ifoo3'])

initFrame = pd.DataFrame({'bar1': [ 5,6,5,6],
                          'bar2': ['a','b','c','d'],
                          'bar3': [11,22,33,44],
                          'bar4': [1,2,1,3]}, index=index)
                           
finDict = {}
#start timer1
uniqueList = list(pd.unique(initFrame[['bar1','bar4']].values))
#end timer1
for x in uniqueList:
    #start timer2
    v = initFrame.xs(x,level=('ifoo2','ifoo3'), drop_level=False)
    #stop timer2
    k = int(x[0]), int(x[1])  
    finDict.update({k:v})

UPDATE 2016-04-04

For those interested, I ended up using the following:

finDict = {}
grouper = initFrame.groupby(level=('ifoo2', 'ifoo3'))
for name, group in grouper:
    finDict.update({name:group})

Upvotes: 1

Views: 1793

Answers (3)

Anzel
Anzel

Reputation: 20553

Not as a answer but just to visualise a thought re my comment, since multi-indexes are grouped, we can simply & possibly just compare and skip the loop if value of ('bar1', 'bar4') equals to the previous value, then perform the dict update.

It may not be speedier, but if your dataset is huge, it could potentially save you a memory consumption problem, pseudo code:

# ...replace timer1...
prev, finDict = None, {}
for n in initFrame[['bar1', 'bar4']].iterrows():
    current = (n[0][1], n[0][2])
    if current == prev: continue
    prev = current
    #... whatever faster way to solve your 2nd timer...

Personally I think @Alexander answers your 2nd timer rather nicely.

Upvotes: 0

HYRY
HYRY

Reputation: 97291

I don't know what you really want to do, but here is some hint to speedup your code:

change

uniqueList = list(pd.unique(initFrame[['bar1','bar4']].values))

to

uniqueList = initFrame[["bar1", "bar4"]].drop_duplicates().values.tolist()

and the for loop to :

g = initFrame.groupby(level=(1, 2))
uniqueSet = set(uniqueList)
dict((key, df) for key, df in g if key in uniqueSet)

or:

g = initFrame.groupby(level=(1, 2))
dict((key, g.get_group(key)) for key in uniqueList)

Here is the %timeit compare:

import numpy as np
import pandas as pd
arr = np.random.randint(0, 10, (10000, 2))
df = pd.DataFrame(arr, columns=("A", "B"))

%timeit df.drop_duplicates().values.tolist()
%timeit list(pd.unique(arr))

outputs:

100 loops, best of 3: 3.51 ms per loop
10 loops, best of 3: 94.7 ms per loop

Upvotes: 1

Alexander
Alexander

Reputation: 109546

You can use a dictionary comprehension together with loc to do the dataframe indexing:

finDict = {pair: df.loc[pd.IndexSlice[:, pair[0], pair[1]], :] 
           for pair in pd.unique(initFrame[['bar1', 'bar4']].values).tolist()}

>>> finDict
{(5, 1):                     bar1 bar2  bar3  bar4
 ifoo1  ifoo2 ifoo3                       
 LABEL1 5     1         5    a    11     1
              1         6    b    22     2,
 (6, 2):                     bar1 bar2  bar3  bar4
 ifoo1  ifoo2 ifoo3                       
 LABEL2 6     2         5    c    33     1,
 (6, 3):                     bar1 bar2  bar3  bar4
 ifoo1  ifoo2 ifoo3                       
 LABEL2 6     3         6    d    44     3}

Upvotes: 1

Related Questions