Reputation: 149
I have a pandas dataframe which is the output of a sql query which returns hourly values if the values do not meet a particular Threshold.
date_date | hour24 | column ------------------------------------ 2017-10-29 | 00:00 | 5.8055152395 2017-10-29 | 01:00 | 1.2578616352 2017-10-29 | 02:00 | -1.5197568389 2017-10-29 | 03:00 | -12.5560538117 2017-10-29 | 04:00 | -15.6862745098 2017-10-29 | 05:00 | -18.487394958 2017-10-29 | 06:00 | -13.2911392405 2017-10-29 | 07:00 | -9.3385214008 2017-10-29 | 08:00 | -15.3846153846 2017-10-28 | 00:00 | 6.9666182874 2017-10-28 | 01:00 | 8.3857442348 2017-10-28 | 02:00 | 8.8145896657 2017-10-28 | 03:00 | 4.0358744395 2017-10-28 | 04:00 | 13.0718954248 2017-10-28 | 05:00 | 0 2017-10-28 | 06:00 | 13.9240506329 2017-10-28 | 07:00 | 24.513618677
I use this output to create a report. So for each hour if the query returns a value I want it to be marked as Failed, but I also want the hours for which the values didn't cross the threshold to be marked as Passed. For e.g.
date_date | hour24 | Result ------------------------------ 2017-10-29 | 00:00 | Failed 2017-10-29 | 01:00 | Failed 2017-10-29 | 02:00 | Failed 2017-10-29 | 03:00 | Failed 2017-10-29 | 04:00 | Failed 2017-10-29 | 05:00 | Failed 2017-10-29 | 06:00 | Failed 2017-10-29 | 07:00 | Failed 2017-10-29 | 08:00 | Failed 2017-10-29 | 09:00 | Passed 2017-10-29 | 10:00 | Passed 2017-10-29 | 11:00 | Passed 2017-10-29 | 12:00 | Passed 2017-10-29 | 13:00 | Passed 2017-10-29 | 14:00 | Passed 2017-10-29 | 15:00 | Passed 2017-10-29 | 16:00 | Passed 2017-10-29 | 17:00 | Passed 2017-10-29 | 18:00 | Passed 2017-10-29 | 19:00 | Passed 2017-10-29 | 20:00 | Passed 2017-10-29 | 21:00 | Passed 2017-10-29 | 22:00 | Passed 2017-10-29 | 23:00 | Passed 2017-10-28 | 00:00 | Failed 2017-10-28 | 01:00 | Failed . . .
Upvotes: 0
Views: 1526
Reputation: 149
This is how it was done using eventually, using the inputs provided here:
A dataframe was created containing only the time series values:
In [1]: df_time = pd.DataFrame(pd.date_range(start='20171029 00', end='20171030 00', freq='1H'), columns=['date_date'])
In [2]:df_time.head()
Out[2]:
date_date
0 2017-10-29 00:00:00
1 2017-10-29 01:00:00
2 2017-10-29 02:00:00
3 2017-10-29 03:00:00
4 2017-10-29 04:00:00
The original dataframe:
In[3]:df.head()
out[3]:
date_date hour24 ?column? 0 2017-10-29 00:00 5.805515 1 2017-10-29 01:00 1.257862 2 2017-10-29 02:00 -1.519757 3 2017-10-29 03:00 -12.556054 4 2017-10-29 04:00 -15.686275
Combine the date and hours in the first column, convert it to np.datetime64
In[4]:df['date_date'] = (df.date_date.map(str) + ' ' + df.hour24.map(str)).astype(np.datetime64)
Merge the dataframes df and df_time
In[5]: df = pd.merge(df, df_time, on=['date_date'], how='outer')
Set the status using np.where, where the condition is met.
In[6]: df['Status'] = np.where(pd.isnull(df.iloc[:,2]), 'Pass', 'Fail')
Drop the unnecessary columns.
In[7]: df.drop(['?column?', 'hour24'], inplace=True, axis=1)
Sort the values.
In[8]: df.sort_values(by=['date_date'], inplace=True)
Which finally gives the desired output.
In[9]: df.head(24)
Out[9]:
date_date Status 9 2017-10-28 00:00:00 Fail 10 2017-10-28 01:00:00 Fail 11 2017-10-28 02:00:00 Fail 12 2017-10-28 03:00:00 Fail 13 2017-10-28 04:00:00 Fail 14 2017-10-28 05:00:00 Fail 15 2017-10-28 06:00:00 Fail 16 2017-10-28 07:00:00 Fail 17 2017-10-28 08:00:00 Pass 18 2017-10-28 09:00:00 Pass 19 2017-10-28 10:00:00 Pass 20 2017-10-28 11:00:00 Pass 21 2017-10-28 12:00:00 Pass 22 2017-10-28 13:00:00 Pass 23 2017-10-28 14:00:00 Pass 24 2017-10-28 15:00:00 Pass 25 2017-10-28 16:00:00 Pass 26 2017-10-28 17:00:00 Pass 27 2017-10-28 18:00:00 Pass 28 2017-10-28 19:00:00 Pass 29 2017-10-28 20:00:00 Pass 30 2017-10-28 21:00:00 Pass 31 2017-10-28 22:00:00 Pass 32 2017-10-28 23:00:00 Pass
Upvotes: 4
Reputation: 192
You can create a sample dataframe with columns required for reporting like
In [1]: reporting_df.columns
Out[1]: Index(['date_date', 'Hour'], dtype='object')`
And merge the reporting_df with the dataframe from your SQL query output on data_date column
In [2]: out_df = pd.merge(left=reporting_df, right=query_df, on='date_date', how='inner')
Out[2]: out_df.head(3)
date_date hour24 column
2017-10-29 00:00 5.8055152395
2017-10-29 01:00 1.2578616352
2017-10-29 02:00 -1.5197568389
2017-10-29 03:00 -12.5560538117
2017-10-29 04:00 NaN
and use np.where for getting status
In [3]: out_df['Status'] = np.where(pd.isnull(out_df['column']), 'Success', 'Fail')
and drop the unrequired columns
In [4]: out_df.drop('column', axis=1, inplace=True)
In [5]: out_df.head(3)
Out[5]:
date_date hour24 status
2017-10-29 00:00 Fail
2017-10-29 01:00 Fail
2017-10-29 02:00 Fail
2017-10-29 03:00 Fail
2017-10-29 04:00 Pass
Upvotes: 3
Reputation: 1725
You could apply a function to your dataframe which checks if each row matches a row from another dataframe.
This solution assumes a few things:
1. Dates and times are unique
2. You have, or can create a dataframe (df_all_dates) which contains all the dates and times you want to test.
# some setup for this example
import pandas as pd
x, y = [], []
for i in range(5):
x.append("{:02d}:00".format(i))
y.append("2017-10-29")
df_all_dates = pd.DataFrame(data=list(zip(x,y)), columns=["date", "time"])
df_all_dates
output:
date time
0 00:00 2017-10-29
1 01:00 2017-10-29
2 02:00 2017-10-29
3 03:00 2017-10-29
4 04:00 2017-10-29
# as an example, assume the first two rows were returned by your sql query
df_sql_query = df_all_dates.head(2)
df_sql_query
output:
date time
0 00:00 2017-10-29
1 01:00 2017-10-29
the important bit:
# check if each row in df_all_dates has the same date and time
# as any row from df_sql_query
def passed_threshold(data):
if ((df_sql_query['date'] == data["date"]) & (df_sql_query['time'] == data["time"])).any():
return "Failed"
else:
return "Passed"
df_all_dates["Result"] = df_all_dates.apply(passed_threshold, axis=1)
df_all_dates
output:
date time Result
0 00:00 2017-10-29 Failed
1 01:00 2017-10-29 Failed
2 02:00 2017-10-29 Passed
3 03:00 2017-10-29 Passed
4 04:00 2017-10-29 Passed
Upvotes: 1