Reputation: 13
I am extracting maximum rainfall intensity for different durations using data with 5-minute rainfall totals. The code produces a list of max rainfall intensity for each duration (DURS). The code works but is slow when using data sets with 1,000,000+ rows. I am new to Pandas and I understand the apply()
method is much faster than using a For loop but I do not know how to re-write a For loop using the apply()
method.
Example of dataframe:
Value[mm] State of value
Date_Time
2020-01-01 00:00:00 1.0 5
2020-01-01 00:05:00 0.5 5
2020-01-01 00:10:00 4.0 5
2020-01-01 00:15:00 2.0 5
2020-01-01 00:20:00 2.0 5
2020-01-01 00:25:00 0.5 5
Example of Code:
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
import math, numpy, array, glob
import pandas as pd
import numpy as np
pluvi_file = "rain.csv"
DURS = [5,6,10,15,20,25,30,45,60,90,120,180,270,360,540,720,1440,2880,4320]
df = pd.read_csv(pluvi_file, delimiter=',',parse_dates=[['Date','Time']])
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df.index = df['Date_Time']
del df['Date_Time']
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1['Value[mm]'].max()/DUR*60
print(a)
lista.append(a)
Output (Max rainfall intensity for each duration in mm/hr):
5 66.0
6 60.0
10 54.0
15 40.0
20 40.5
25 30.0
30 34.0
45 26.666666666666664
60 26.5
90 20.666666666666668
120 23.0
180 12.166666666666666
270 8.11111111111111
360 9.416666666666666
540 6.444444444444445
720 4.708333333333333
1440 3.8958333333333335
2880 2.7708333333333335
4320 2.1597222222222223
How would I re-write this using the apply()
method?
Upvotes: 1
Views: 2841
Reputation: 865
It looks like applying doesn't suit here, since functions you are applying on groups are vectorised methods from Essential Basic Functionality. Also removing of for loop doesn't look like a promising way for performance optimization, since there are no too much durations in your DURS
list, so the main issue is grouping operation and calculations on groups, and there's no too much space for optimization, at least at my opinion.
import pandas as pd
df = pd.DataFrame({'Date_Time' : ["2020-01-01 00:00:00",
"2020-01-01 00:05:00",
"2020-01-01 00:10:00",
"2020-01-01 00:15:00",
"2020-01-01 00:20:00",
"2020-01-01 00:25:00"],
'Value[mm]' : [1.0,0.5,4.0,2.0,2.0,0.5],
'State of value': [5,5,5,5,5,5]
})
df = df.sample(3900875, replace=True).reset_index(drop=True)
Now, lets set Date_Time
as an index, and get just series we need to calculate our values
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df = df.set_index('Date_Time', drop = True)
df = df['Value[mm]']
%%timeit
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1.max()/DUR*60
lista.append(a)
19.6 s ± 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Time boost is probably random here, since it looks like the same is hapening under the hood.
%%timeit
def get_max_by_dur(DUR):
return df.resample(str(DUR)+"Min").sum().max()
l_a = [get_max_by_dur(dur)/dur*60 for dur in DURS]
17.2 s ± 559 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Despite the fact that there's no way to properly vectorize - you still can make some parallelization and optimization with Dask.
!python3 -m pip install "dask[dataframe]" --upgrade
import dask.dataframe as dd
%%timeit
dd_df = dd.from_pandas(df, npartitions = 1)
def get_max_by_dur(DUR):
return dd_df.resample(str(DUR)+"Min").sum().max()
l_a = [(get_max_by_dur(dur)/dur*60).compute() for dur in DURS]
2.21 s ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Usually, you use apply, to apply a function along the axis of a DataFrame. So that's the substitution for looping thru rows or columns of DataFrame itself, but in reality, apply is just a glorified loop with some extra functionality. So, when the performance matters you usually want to optimize your code like this.
Let's say you want to get a product of two columns
1). Vectorization or basic methods.
Basic methods:
df["product"] = df.prod(axis=1)
162 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Vectorization:
import numpy as np
def multiply(Value,State): # you may use lambda here as well
return Value*State
%timeit df["new_column"] = np.vectorize(multiply) (df["Value[mm]"], df["State of value"])
853 ms ± 42.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It can be very useful in cases if you already wrote some looping. You can often, just decorate it with @numba.jit
and achieve significant performance boost. It's also very helpful when you want to compute some iterative value, which is difficult to vectorize.
Since the function we choose is multiplication you'll not have benefits comparing to usual apply.
%%cython
cdef double cython_multiply(double Value, double State):
return Value * State
%timeit df["new_column"] = df.apply(lambda row:multiply(row["Value[mm]"], row["State of value"]), axis = 1)
1min 38s ± 4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
3). List comprehension.
It's pythonic and, also quite similar to for loop.
%timeit df["new_column"] = [x*y for x, y in zip(df["Value[mm]"], df["State of value"])]
1.56 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4). Apply method
Notice, how slow it is.
%timeit df["new_column"] = df.apply(lambda row:row["Value[mm]"]*row["State of value"], axis = 1)
1min 37s ± 4.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
5). Looping thru rows
itertuples:
%%timeit
list_a = []
for row in df.itertuples():
list_a.append(row[2]*row[3])
df['product'] = list_a
9.81 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
iterrows (you probably shouldn't use that):
%%timeit
list_a = []
for row in df.iterrows():
list_a.append(row[1][1]*row[1][2])
df['product'] = list_a
6min 40s ± 1min 8s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 3