Reputation: 53
I am slicing a large amount of dataframes over time.
However, the run times of the two codes below show a big difference.
code1.
for i in city_list:
from_line = df[(df['Timestamp']==time_from)
& (df['city']==i)]
to_line = df[(df['Timestamp']==time_to)
& (df['city']==i)]
value1 = len(from_line['Value'])
value2 = len(to_line['Value'])
...
...
code2.
for i in city_list:
from_line_sub = df[df['Timestamp']==time_from]
from_line = from_line_sub[from_line_sub['city']==i]
to_line_sub = df[df['Timestamp']==time_to)]
to_line = to_line_sub[to_line_sub['city']==i]
value1 = len(from_line['Value'])
value2 = len(to_line['Value'])
...
...
Code1 takes over an hour, while code2 takes just over 20 minutes.
I think the result of slicing is the same, but what is the reason for this speed difference?
This speed difference can be seen not only with ==
condition, but also with >
and <
conditions.
Maybe I am missing something?
Thanks for reading this far.
Upvotes: 1
Views: 238
Reputation: 159
Have you tried filtering in another order?
How long does code2 take with this modification?
for i in city_list:
df_t = df[df['city']==i] # filter only one time by each city
from_line = df_t[df_t['Timestamp']==time_from]
to_line = df_t[df_t['Timestamp']==time_to]
value1 = len(from_line['Value'])
value2 = len(to_line['Value'])
...
Upvotes: 1
Reputation: 405755
It looks like this is because the line
from_line = df[(df['Timestamp']==time_from) & (df['city']==i)]
is not short-circuiting on the &
operator, like you'd expect. That means it's evaluating both conditions for every row in df
to return the rows where both conditions are True
.
In the second example, this line
from_line_sub = df[df['Timestamp']==time_from]
is returning a smaller subset of the original dataframe (presumably much smaller, given the execution time difference), then the second condition is only applied to that subset of results in the next line.
Here's an example that shows short-circuiting does not work as expected.
import pandas as pd
df = pd.DataFrame({'A': pd.Series([1, 2, 3], dtype=object),
'B': pd.Series([4, 5, 0], dtype=object)})
a1 = df[(df['A'] == 1) & (1 / df['B'] == 0.25)]
print(a1)
If the &
operator were short-circuiting the second condition, you'd expect the 1 / df['B']
to be evaluated only for the first row (where df['A'] == 1
), and only that row would be printed. This code throws a ZeroDivisionError: division by zero
error though, so the second condition must be evaluated for all rows.
Upvotes: 2