Reputation: 13498
Running this code shows the difference in speed between pandas and a regular python list:
ser = pd.Series(range(100))
lst = ser.tolist()
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(100000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
The results (10 trials of indexing random elements 100000 times):
Pandas Regular List
0.6404812335968018 0.03125190734863281
0.6560468673706055 0.0
0.5779874324798584 0.01562190055847168
0.5467743873596191 0.015621662139892578
0.6106545925140381 0.004016399383544922
0.5866603851318359 0.029597759246826172
0.7981059551239014 0.016004562377929688
0.8128316402435303 0.013040542602539062
0.5566465854644775 0.021578073501586914
0.6386256217956543 0.00500178337097168
Indexing a pandas series seems to be 30 - 100 times slower than a python list. Why? How can we speed this up?
Upvotes: 1
Views: 8512
Reputation: 945
pandas's implementation of index and reindex is of low quality. It contains too much overhead.
See the following link for futher discussion. https://github.com/pandas-dev/pandas/issues/23735
Upvotes: 1
Reputation: 13498
I checked pandas'
source code. The __getitem__
implementation in a pandas series has a lot of additional business logic compared to the regular python list, because the pandas series supports indexing with lists and iterables.
When indexing a pandas series the series:
Tries to apply the key if it is callable
Gets the value of the index at that key (sounds simple enough, but keep in mind the index is another pandas object that also has to support more than regular indexing)
Checks if 2) is scalar
If it's scalar return the result
These additional steps slow down the __getitem__
dramatically compared to a regular python list.
To workaround this you can directly work with the underlying numpy array. Here we use ser.values
to index instead:
ser = pd.Series(range(100))
lst = ser.tolist()
ser = ser.values
for _ in range(10):
pandas_time = 0
list_time = 0
for _ in range(1000000):
r = randint(0, len(ser)-1)
t = time()
ser[r]
pandas_time += time() - t
t = time()
lst[r]
list_time += time() - t
print(pandas_time, list_time)
After indexing 1000000 random elements 10 times, we find that using .values
is much faster than just indexing the pandas series but is still slower than using a python list
pd.Series.values Regular List
0.18845057487487793 0.04786252975463867
0.10950899124145508 0.11034011840820312
0.048889875411987305 0.09512066841125488
0.17272686958312988 0.1406867504119873
0.14252233505249023 0.048066139221191406
0.06352949142456055 0.07906699180603027
0.1405477523803711 0.07815265655517578
0.18746685981750488 0.08007645606994629
0.1405184268951416 0.0781564712524414
0.07921838760375977 0.1412496566772461
To summarize using .values
is the way to go when you need to quickly index a pandas Series. While it looks like .tolist()
is faster, keep in mind that it will only be slightly faster when indexing individual elements. Numpy arrays will support much faster fancy indexing like indexing with multiple elements.
Upvotes: 4