Andrey
Andrey

Reputation: 60065

Pandas MultiIndex single level look up is much slower than alternative access patterns

I have this isolated code snippet that should be self-explanatory:

import string
import itertools
import numpy as np
import timeit

index = list(itertools.product(range(100_000), string.ascii_uppercase))

df = pd.DataFrame(index, columns=['i', 'p'])
df['n'] = np.random.randn(len(df))
df_2 = df.set_index('i', drop=False)
df = df.set_index(['i', 'p'], drop=False)

print('One level multiindex loc', timeit.timeit(lambda: df.loc[1000], number=100))
print('Search by column',         timeit.timeit(lambda: df[df.i == 1000], number=100))
print('Non unique index loc',     timeit.timeit(lambda: df_2.loc[1000], number=100))

Results:

One level multiindex loc 0.8600521469925297
Search by column 0.23243567100143991
Non unique index loc 0.03276521500083618

I need to get group of rows by value of i (1000 in this example) and I am looking into different access patterns. What I absolutely don't get is why looking up by the first level of a MultiIndex is so slow? I understood the concept of the MultiIndex by it being fast if you are following the hierarchy.

Edit results vary significantly based on pandas version and now more looks like there is something broken with MultiIndex in pandas 1.1.0.

Here are results for 0.25:

print('One level multiindex loc 1', timeit.timeit(lambda: df.loc[1000], number=10000))
print('One level multiindex loc 2', timeit.timeit(lambda: df.loc[(1000, ), :], number=10000))
print('Search by column',           timeit.timeit(lambda: df[df.i == 1000], number=10000))
print('Non unique index loc',       timeit.timeit(lambda: df_2.loc[1000], number=10000))

One level multiindex loc 1 3.5869441789999996
One level multiindex loc 2 4.696559950999983
Search by column 26.05316364800001
Non unique index loc 2.409704655000013

To compare with 10000 repetitions with pandas==1.1.0:

One level multiindex loc 1 74.58197712
One level multiindex loc 2 74.65480156499996
Search by column 26.241522830999997
Non unique index loc 0.5789623329999927

pandas==1.0.5:

One level multiindex loc 1 75.16352942799999
One level multiindex loc 2 81.75229192099998
Search by column 25.121312993000004
Non unique index loc 2.481764503999983

Upvotes: 3

Views: 363

Answers (1)

JE_Muc
JE_Muc

Reputation: 5774

I need to update my answer, since some additional timings show completely different results:

import string
import itertools
import numpy as np
import timeit

index = list(itertools.product(range(100_000), string.ascii_uppercase))

df = pd.DataFrame(index, columns=['i', 'p'])
df['n'] = np.random.randn(len(df))
df_2 = df.set_index('i', drop=False)
df = df.set_index(['i', 'p'], drop=False)
df3 = df.copy().sort_index(level=0)

print('One level multiindex loc with tuple', timeit.timeit(lambda: df.loc[(1000, )], number=100))
print('One level multiindex loc', timeit.timeit(lambda: df.loc[1000], number=100))
print('Explicitly sorted one level multiindex loc', timeit.timeit(lambda: df3.loc[(1000, )], number=100))
print('Explicitly sorted one level multiindex loc with tuple', timeit.timeit(lambda: df3.loc[1000], number=100))
print('Search by column',         timeit.timeit(lambda: df[df.i == 1000], number=100))
print('Non unique index loc',     timeit.timeit(lambda: df_2.loc[1000], number=100))

# One level multiindex loc with tuple 0.05624850000003789
# One level multiindex loc 0.029734599999983402
# Explicitly sorted one level multiindex loc 0.03403290000005654
# Explicitly sorted one level multiindex loc with tuple 0.028620700000146826
# Search by column 0.5066366999999445
# Non unique index loc 0.0468722999999045

My pandas pd.__version__ == 1.0.5.

It seems like there is some performance regression from version 1.0.5 to 1.1.0.

Sorting the index explicitly seems to increase the the indexing speed, so that MultiIndex loc is the fastest method. I don't know what's going on internally, but even the not explicitly sorted df.index shows df.index.get_level_values(0).is_monotonic_increasing == True, so it should be handled like a sorted index. Maybe someone of the pandas core dev team can shed some light?

Upvotes: 1

Related Questions