KiWiChoco
KiWiChoco

Reputation: 53

Why is there a speed difference when slicing a pandas dataframe?

I am slicing a large amount of dataframes over time.

However, the run times of the two codes below show a big difference.

code1.

for i in city_list:
     from_line = df[(df['Timestamp']==time_from) 
                               & (df['city']==i)]
     to_line = df[(df['Timestamp']==time_to) 
                               & (df['city']==i)]

     value1 = len(from_line['Value'])
     value2 = len(to_line['Value'])
     ...
     ...

code2.

for i in city_list:

     from_line_sub = df[df['Timestamp']==time_from]
     from_line = from_line_sub[from_line_sub['city']==i]

     to_line_sub = df[df['Timestamp']==time_to)]                           
     to_line = to_line_sub[to_line_sub['city']==i]

     value1 = len(from_line['Value'])
     value2 = len(to_line['Value'])
     ...
     ...

Code1 takes over an hour, while code2 takes just over 20 minutes.

I think the result of slicing is the same, but what is the reason for this speed difference?

This speed difference can be seen not only with == condition, but also with > and < conditions.

Maybe I am missing something?

Thanks for reading this far.

Upvotes: 1

Views: 238

Answers (2)

otromas
otromas

Reputation: 159

Have you tried filtering in another order?

How long does code2 take with this modification?

for i in city_list:
 
 df_t = df[df['city']==i]  # filter only one time by each city

 from_line = df_t[df_t['Timestamp']==time_from]
                       
 to_line = df_t[df_t['Timestamp']==time_to]

 value1 = len(from_line['Value'])
 value2 = len(to_line['Value'])
 ...

Upvotes: 1

Bill the Lizard
Bill the Lizard

Reputation: 405755

It looks like this is because the line

from_line = df[(df['Timestamp']==time_from) & (df['city']==i)]

is not short-circuiting on the & operator, like you'd expect. That means it's evaluating both conditions for every row in df to return the rows where both conditions are True.

In the second example, this line

from_line_sub = df[df['Timestamp']==time_from]

is returning a smaller subset of the original dataframe (presumably much smaller, given the execution time difference), then the second condition is only applied to that subset of results in the next line.


Here's an example that shows short-circuiting does not work as expected.

import pandas as pd

df = pd.DataFrame({'A': pd.Series([1, 2, 3], dtype=object),
                   'B': pd.Series([4, 5, 0], dtype=object)})

a1 = df[(df['A'] == 1) & (1 / df['B'] == 0.25)]
print(a1)

If the & operator were short-circuiting the second condition, you'd expect the 1 / df['B'] to be evaluated only for the first row (where df['A'] == 1), and only that row would be printed. This code throws a ZeroDivisionError: division by zero error though, so the second condition must be evaluated for all rows.

Upvotes: 2

Related Questions