Reputation: 1921
I have a large dataframe with multi-index. I wanted to slice this dataframe using a fairly large list. Below is a sample code. It is taking almost 10 seconds for this operation.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"x": np.repeat(np.arange(10000), 50),
"y": np.repeat(np.arange(50), 10000),
"val": np.random.rand(50*10000)
}
).set_index(["x", "y"])
large_list = range(5000,10000)
slice = df.loc[(large_list, slice(None)),:] # Takes 10 seconds on my machine
As a comparison, if I write this dataframe to an hdf file and read it with a where
condition same as my slicing operation, it takes only 1.5 seconds!
df.to_hdf("sample.hdf", key="df", append=True)
df1 = pd.read_hdf("sample.hdf", "df", where='x in large_list')
Is there a faster way to slice in memory?
Upvotes: 2
Views: 51
Reputation: 25259
If your intention is slicing multiindex by an arbitrary list, using query
will be much faster
Create an arbitrary list from 5000
to 10000
np.random.seed(0)
large_list = np.random.choice(list(range(5000, 10000)), 5000, replace=False)
In [2245]: large_list
Out[2245]: array([5398, 8833, 9836, ..., 6653, 7607, 7732])
x = df.query('x in @large_list')
Compare result
In [2246]: y = df.loc[(large_list, slice(None)),:]
In [2249]: np.allclose(x, y)
Out[2249]: True
Upvotes: 2