Primusa
Primusa

Reputation: 13498

Why is pandas' indexing so slow? How to make it faster?

Running this code shows the difference in speed between pandas and a regular python list:

ser = pd.Series(range(100))
lst = ser.tolist()

for _ in range(10):
    pandas_time = 0
    list_time = 0
    for _ in range(100000):
        r = randint(0, len(ser)-1)
        t = time()
        ser[r]
        pandas_time += time() - t

        t = time()
        lst[r]
        list_time += time() - t

    print(pandas_time, list_time)

The results (10 trials of indexing random elements 100000 times):

Pandas             Regular List
0.6404812335968018 0.03125190734863281
0.6560468673706055 0.0
0.5779874324798584 0.01562190055847168
0.5467743873596191 0.015621662139892578
0.6106545925140381 0.004016399383544922
0.5866603851318359 0.029597759246826172
0.7981059551239014 0.016004562377929688
0.8128316402435303 0.013040542602539062
0.5566465854644775 0.021578073501586914
0.6386256217956543 0.00500178337097168

Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?

Upvotes: 1

Views: 8512

Answers (2)

Wei Qiu
Wei Qiu

Reputation: 945

pandas's implementation of index and reindex is of low quality. It contains too much overhead.

See the following link for futher discussion. https://github.com/pandas-dev/pandas/issues/23735

Upvotes: 1

Primusa
Primusa

Reputation: 13498

I checked pandas' source code. The __getitem__ implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.

When indexing a pandas series the series:

  1. Tries to apply the key if it is callable

  2. Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)

  3. Checks if 2) is scalar

  4. If it's scalar return the result

These additional steps slow down the __getitem__ dramatically compared to a regular python list.

To workaround this you can directly work with the underlying numpy array. Here we use ser.values to index instead:

ser = pd.Series(range(100))
lst = ser.tolist()

ser = ser.values

for _ in range(10):
    pandas_time = 0
    list_time = 0
    for _ in range(1000000):
        r = randint(0, len(ser)-1)
        t = time()
        ser[r]
        pandas_time += time() - t

        t = time()
        lst[r]
        list_time += time() - t

    print(pandas_time, list_time)

After indexing 1000000 random elements 10 times, we find that using .values is much faster than just indexing the pandas series but is still slower than using a python list

pd.Series.values    Regular List
0.18845057487487793 0.04786252975463867
0.10950899124145508 0.11034011840820312
0.048889875411987305 0.09512066841125488
0.17272686958312988 0.1406867504119873
0.14252233505249023 0.048066139221191406
0.06352949142456055 0.07906699180603027
0.1405477523803711 0.07815265655517578
0.18746685981750488 0.08007645606994629
0.1405184268951416 0.0781564712524414
0.07921838760375977 0.1412496566772461

To summarize using .values is the way to go when you need to quickly index a pandas Series. While it looks like .tolist() is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.

Upvotes: 4

Related Questions