Reputation: 2740
I have a data frame df
which has dates in it:
df['Survey_Date'].head(4)
Out[65]:
0 1990-09-28
1 1991-07-26
2 1991-11-23
3 1992-10-15
I am interested in calculating a metric between two of the dates, using a separate data frame flow_df
.
flow_df
looks like:
date flow
0 1989-01-01 7480
1 1989-01-02 5070
2 1989-01-03 6410
3 1989-01-04 10900
4 1989-01-05 11700
For instance, I would like to query another data frame based on the current_date
and early_date
. The first time period of interest would be:
current_date = 1991-07-26
early_date = 1990-09-28
I have written a clunky for loop and it gets the job done, but I am sure there is a more elegant way:
My approach with a counter and for loop:
def find_peak(early_date,current_date,flow_df):
mask = (flow_df['date']>= early_date) & (flow_df['date'] < current_date)
query = flow_df.loc[mask]
peak_flow = np.max(query['flow'])*0.3048**3
return peak_flow
n=0
for thing in df['Survey_Date'][1:]:
early_date = df['Survey_Date'][n]
current_date = thing
peak_flow = find_peak(early_date,current_date,flow_df)
n+=1
df['Avg_Stage'][n] = peak_flow
How can I do this without a counter and for loop?
The desired output looks like:
Survey_Date Avg_Stage
0 1990-09-28
1 1991-07-26 574.831986
2 1991-11-23 526.693347
3 1992-10-15 458.732915
4 1993-04-01 855.168767
5 1993-11-17 470.059653
6 1994-04-07 419.089330
7 1994-10-21 450.237861
8 1995-04-24 498.376500
9 1995-06-23 506.871554
Upvotes: 0
Views: 200
Reputation: 8464
You can use zip()
:
for early_date, current_date in zip(df['Survey_Date'], df['Survey_Date'][1:]):
#do whatever yo want.
Of course you can put it into a list comprehension:
[some_metric(early_date, current_date) for early_date, current_date in zip(df['Survey_Date'], df['Survey_Date'][1:])]
Upvotes: 0
Reputation: 6383
You can define a new variable that identifies survey period and use pandas.DataFrame.groupby
to avoid for
loop. It should be much faster when flow_df
is large.
#convert both to datetime, if they are not
df['Survey_Date'] = pd.to_datetime(df['Survey_Date'])
flow_df['date'] = pd.to_datetime(flow_df['date'])
#Merge Survey_Date to flow_df. Most rows of flow_df['Survey_Date'] should be NaT
flow_df = flow_df.merge(df, left_on='date', right_on='Survey_Date', how='outer')
# In case not all Survey_Date in flow_df['date'] or data not sorted by date.
flow_df['date'].fillna(flow_df['Survey_Date'], inplace=True)
flow_df.sort_values('date', inplace=True)
#Identify survey period. In your example: [1990-09-28, 1991-07-26) is represented by 0; [1991-07-26, 1991-11-23) = 1; etc.
flow_df['survey_period'] = flow_df['Survey_Date'].notnull().cumsum()
#calc Avg_Stage in each survey_period. I did .shift(1) because you want to align period [1990-09-28, 1991-07-26) to 1991-07-26
df['Avg_Stage'] = (flow_df.groupby('survey_period')['flow'].max()*0.3048**3).shift(1)
Upvotes: 2