Reputation: 692
Anyone care to take a stab at speeding up this dataframe index slicing scheme? I'm trying to slice and dice some huge dataframes, so every bit counts. I need to somehow find a faster way of index slicing the dataframe, other than the following technique:
v = initFrame.xs(x,level=('ifoo2','ifoo3'), drop_level=False)
Also the loop in pd.unique is impacting performance pretty significantly.
uniqueList = list(pd.unique(initFrame[['bar1','bar4']].values))
Copy and paste the below snippet to avoid setup.
import pandas as pd
foo1 = (['LABEL1','LABEL1','LABEL2','LABEL2'])
foo2 = ([5,5,6,6])
foo3 = ([1,1,2,3])
index = pd.MultiIndex.from_arrays([foo1,foo2,foo3], names=['ifoo1','ifoo2','ifoo3'])
initFrame = pd.DataFrame({'bar1': [ 5,6,5,6],
'bar2': ['a','b','c','d'],
'bar3': [11,22,33,44],
'bar4': [1,2,1,3]}, index=index)
finDict = {}
#start timer1
uniqueList = list(pd.unique(initFrame[['bar1','bar4']].values))
#end timer1
for x in uniqueList:
#start timer2
v = initFrame.xs(x,level=('ifoo2','ifoo3'), drop_level=False)
#stop timer2
k = int(x[0]), int(x[1])
finDict.update({k:v})
For those interested, I ended up using the following:
finDict = {}
grouper = initFrame.groupby(level=('ifoo2', 'ifoo3'))
for name, group in grouper:
finDict.update({name:group})
Upvotes: 1
Views: 1793
Reputation: 20553
Not as a answer but just to visualise a thought re my comment, since multi-indexes are grouped, we can simply & possibly just compare and skip the loop if value of ('bar1', 'bar4') equals to the previous value, then perform the dict update.
It may not be speedier, but if your dataset is huge, it could potentially save you a memory consumption problem, pseudo code:
# ...replace timer1...
prev, finDict = None, {}
for n in initFrame[['bar1', 'bar4']].iterrows():
current = (n[0][1], n[0][2])
if current == prev: continue
prev = current
#... whatever faster way to solve your 2nd timer...
Personally I think @Alexander answers your 2nd timer rather nicely.
Upvotes: 0
Reputation: 97291
I don't know what you really want to do, but here is some hint to speedup your code:
change
uniqueList = list(pd.unique(initFrame[['bar1','bar4']].values))
to
uniqueList = initFrame[["bar1", "bar4"]].drop_duplicates().values.tolist()
and the for loop to :
g = initFrame.groupby(level=(1, 2))
uniqueSet = set(uniqueList)
dict((key, df) for key, df in g if key in uniqueSet)
or:
g = initFrame.groupby(level=(1, 2))
dict((key, g.get_group(key)) for key in uniqueList)
Here is the %timeit compare:
import numpy as np
import pandas as pd
arr = np.random.randint(0, 10, (10000, 2))
df = pd.DataFrame(arr, columns=("A", "B"))
%timeit df.drop_duplicates().values.tolist()
%timeit list(pd.unique(arr))
outputs:
100 loops, best of 3: 3.51 ms per loop
10 loops, best of 3: 94.7 ms per loop
Upvotes: 1
Reputation: 109546
You can use a dictionary comprehension together with loc
to do the dataframe indexing:
finDict = {pair: df.loc[pd.IndexSlice[:, pair[0], pair[1]], :]
for pair in pd.unique(initFrame[['bar1', 'bar4']].values).tolist()}
>>> finDict
{(5, 1): bar1 bar2 bar3 bar4
ifoo1 ifoo2 ifoo3
LABEL1 5 1 5 a 11 1
1 6 b 22 2,
(6, 2): bar1 bar2 bar3 bar4
ifoo1 ifoo2 ifoo3
LABEL2 6 2 5 c 33 1,
(6, 3): bar1 bar2 bar3 bar4
ifoo1 ifoo2 ifoo3
LABEL2 6 3 6 d 44 3}
Upvotes: 1