dubbbdan
dubbbdan

Reputation: 2740

Pandas: Loop through dataframe with out counter

I have a data frame df which has dates in it:

df['Survey_Date'].head(4)
Out[65]: 
0    1990-09-28
1    1991-07-26
2    1991-11-23
3    1992-10-15

I am interested in calculating a metric between two of the dates, using a separate data frame flow_df.

flow_df looks like:

        date   flow
0 1989-01-01   7480
1 1989-01-02   5070
2 1989-01-03   6410
3 1989-01-04  10900
4 1989-01-05  11700

For instance, I would like to query another data frame based on the current_date and early_date. The first time period of interest would be:

current_date = 1991-07-26
early_date = 1990-09-28

I have written a clunky for loop and it gets the job done, but I am sure there is a more elegant way:

My approach with a counter and for loop:

def find_peak(early_date,current_date,flow_df):
    mask = (flow_df['date']>= early_date) & (flow_df['date'] < current_date)
    query = flow_df.loc[mask]
    peak_flow = np.max(query['flow'])*0.3048**3
    return peak_flow

n=0
for thing in df['Survey_Date'][1:]:
    early_date = df['Survey_Date'][n]
    current_date = thing
    peak_flow = find_peak(early_date,current_date,flow_df)
    n+=1
    df['Avg_Stage'][n] = peak_flow

How can I do this without a counter and for loop?

The desired output looks like:

  Survey_Date   Avg_Stage
0  1990-09-28  
1  1991-07-26  574.831986
2  1991-11-23  526.693347
3  1992-10-15  458.732915
4  1993-04-01  855.168767
5  1993-11-17  470.059653
6  1994-04-07  419.089330
7  1994-10-21  450.237861
8  1995-04-24  498.376500
9  1995-06-23  506.871554

Upvotes: 0

Views: 200

Answers (2)

Ohad Eytan
Ohad Eytan

Reputation: 8464

You can use zip():

for early_date, current_date in zip(df['Survey_Date'], df['Survey_Date'][1:]):
    #do whatever yo want.

Of course you can put it into a list comprehension:

[some_metric(early_date, current_date) for early_date, current_date in zip(df['Survey_Date'], df['Survey_Date'][1:])]

Upvotes: 0

Happy001
Happy001

Reputation: 6383

You can define a new variable that identifies survey period and use pandas.DataFrame.groupby to avoid for loop. It should be much faster when flow_df is large.

#convert both to datetime, if they are not
df['Survey_Date'] = pd.to_datetime(df['Survey_Date'])
flow_df['date'] = pd.to_datetime(flow_df['date'])

#Merge Survey_Date to flow_df. Most rows of flow_df['Survey_Date'] should be NaT
flow_df = flow_df.merge(df, left_on='date', right_on='Survey_Date', how='outer')

# In case not all Survey_Date in flow_df['date'] or data not sorted by date.
flow_df['date'].fillna(flow_df['Survey_Date'], inplace=True)
flow_df.sort_values('date', inplace=True)

#Identify survey period. In your example: [1990-09-28, 1991-07-26) is represented by 0; [1991-07-26, 1991-11-23) = 1; etc.
flow_df['survey_period'] = flow_df['Survey_Date'].notnull().cumsum()

#calc Avg_Stage in each survey_period. I did .shift(1) because you want to align period [1990-09-28, 1991-07-26) to 1991-07-26 
df['Avg_Stage'] = (flow_df.groupby('survey_period')['flow'].max()*0.3048**3).shift(1)

Upvotes: 2

Related Questions