theSanjeev
theSanjeev

Reputation: 149

How to fill the missing hour Values in a pandas DataFrame

I have a pandas dataframe which is the output of a sql query which returns hourly values if the values do not meet a particular Threshold.

 date_date  | hour24 | column
------------------------------------
 2017-10-29 | 00:00  | 5.8055152395
 2017-10-29 | 01:00  | 1.2578616352
 2017-10-29 | 02:00  | -1.5197568389
 2017-10-29 | 03:00  | -12.5560538117
 2017-10-29 | 04:00  | -15.6862745098
 2017-10-29 | 05:00  | -18.487394958
 2017-10-29 | 06:00  | -13.2911392405
 2017-10-29 | 07:00  | -9.3385214008
 2017-10-29 | 08:00  | -15.3846153846
 2017-10-28 | 00:00  | 6.9666182874
 2017-10-28 | 01:00  | 8.3857442348
 2017-10-28 | 02:00  | 8.8145896657
 2017-10-28 | 03:00  | 4.0358744395
 2017-10-28 | 04:00  | 13.0718954248
 2017-10-28 | 05:00  | 0
 2017-10-28 | 06:00  | 13.9240506329
 2017-10-28 | 07:00  | 24.513618677

I use this output to create a report. So for each hour if the query returns a value I want it to be marked as Failed, but I also want the hours for which the values didn't cross the threshold to be marked as Passed. For e.g.

 date_date  | hour24 | Result
------------------------------
 2017-10-29 | 00:00  | Failed
 2017-10-29 | 01:00  | Failed
 2017-10-29 | 02:00  | Failed
 2017-10-29 | 03:00  | Failed
 2017-10-29 | 04:00  | Failed
 2017-10-29 | 05:00  | Failed
 2017-10-29 | 06:00  | Failed
 2017-10-29 | 07:00  | Failed
 2017-10-29 | 08:00  | Failed
 2017-10-29 | 09:00  | Passed
 2017-10-29 | 10:00  | Passed
 2017-10-29 | 11:00  | Passed
 2017-10-29 | 12:00  | Passed
 2017-10-29 | 13:00  | Passed
 2017-10-29 | 14:00  | Passed
 2017-10-29 | 15:00  | Passed
 2017-10-29 | 16:00  | Passed
 2017-10-29 | 17:00  | Passed
 2017-10-29 | 18:00  | Passed
 2017-10-29 | 19:00  | Passed
 2017-10-29 | 20:00  | Passed
 2017-10-29 | 21:00  | Passed
 2017-10-29 | 22:00  | Passed
 2017-10-29 | 23:00  | Passed
 2017-10-28 | 00:00  | Failed
 2017-10-28 | 01:00  | Failed
.
.
.

Upvotes: 0

Views: 1526

Answers (3)

theSanjeev
theSanjeev

Reputation: 149

This is how it was done using eventually, using the inputs provided here:

A dataframe was created containing only the time series values:

In [1]: df_time = pd.DataFrame(pd.date_range(start='20171029 00', end='20171030 00', freq='1H'), columns=['date_date'])
In [2]:df_time.head()
Out[2]: 
    date_date
 0 2017-10-29 00:00:00
 1 2017-10-29 01:00:00
 2 2017-10-29 02:00:00
 3 2017-10-29 03:00:00
 4 2017-10-29 04:00:00

The original dataframe:

In[3]:df.head()
out[3]: 
    date_date  hour24   ?column?
0  2017-10-29  00:00   5.805515
1  2017-10-29  01:00   1.257862
2  2017-10-29  02:00  -1.519757
3  2017-10-29  03:00 -12.556054
4  2017-10-29  04:00 -15.686275

Combine the date and hours in the first column, convert it to np.datetime64

In[4]:df['date_date'] = (df.date_date.map(str) + ' ' + df.hour24.map(str)).astype(np.datetime64)

Merge the dataframes df and df_time

In[5]: df = pd.merge(df, df_time, on=['date_date'], how='outer')

Set the status using np.where, where the condition is met.

In[6]: df['Status'] = np.where(pd.isnull(df.iloc[:,2]), 'Pass', 'Fail')

Drop the unnecessary columns.

In[7]: df.drop(['?column?', 'hour24'], inplace=True, axis=1) 

Sort the values.

In[8]: df.sort_values(by=['date_date'], inplace=True)

Which finally gives the desired output.

In[9]: df.head(24)
Out[9]: 
             date_date Status
9  2017-10-28 00:00:00   Fail
10 2017-10-28 01:00:00   Fail
11 2017-10-28 02:00:00   Fail
12 2017-10-28 03:00:00   Fail
13 2017-10-28 04:00:00   Fail
14 2017-10-28 05:00:00   Fail
15 2017-10-28 06:00:00   Fail
16 2017-10-28 07:00:00   Fail
17 2017-10-28 08:00:00   Pass
18 2017-10-28 09:00:00   Pass
19 2017-10-28 10:00:00   Pass
20 2017-10-28 11:00:00   Pass
21 2017-10-28 12:00:00   Pass
22 2017-10-28 13:00:00   Pass
23 2017-10-28 14:00:00   Pass
24 2017-10-28 15:00:00   Pass
25 2017-10-28 16:00:00   Pass
26 2017-10-28 17:00:00   Pass
27 2017-10-28 18:00:00   Pass
28 2017-10-28 19:00:00   Pass
29 2017-10-28 20:00:00   Pass
30 2017-10-28 21:00:00   Pass
31 2017-10-28 22:00:00   Pass
32 2017-10-28 23:00:00   Pass

Upvotes: 4

privatevoid
privatevoid

Reputation: 192

You can create a sample dataframe with columns required for reporting like

In [1]: reporting_df.columns
Out[1]: Index(['date_date', 'Hour'], dtype='object')`

And merge the reporting_df with the dataframe from your SQL query output on data_date column

In [2]: out_df = pd.merge(left=reporting_df, right=query_df, on='date_date', how='inner')
Out[2]: out_df.head(3)
date_date hour24 column
2017-10-29 00:00 5.8055152395
2017-10-29 01:00 1.2578616352
2017-10-29 02:00 -1.5197568389
2017-10-29 03:00 -12.5560538117
2017-10-29 04:00 NaN

and use np.where for getting status

In [3]: out_df['Status'] = np.where(pd.isnull(out_df['column']), 'Success', 'Fail')

and drop the unrequired columns

In [4]: out_df.drop('column', axis=1, inplace=True)
In [5]: out_df.head(3)
Out[5]: 
date_date hour24 status
2017-10-29 00:00 Fail
2017-10-29 01:00 Fail
2017-10-29 02:00 Fail
2017-10-29 03:00 Fail
2017-10-29 04:00 Pass

Upvotes: 3

ConorSheehan1
ConorSheehan1

Reputation: 1725

You could apply a function to your dataframe which checks if each row matches a row from another dataframe.

This solution assumes a few things:
1. Dates and times are unique
2. You have, or can create a dataframe (df_all_dates) which contains all the dates and times you want to test.

# some setup for this example
import pandas as pd
x, y = [], []
for i in range(5):
    x.append("{:02d}:00".format(i))
    y.append("2017-10-29")

df_all_dates = pd.DataFrame(data=list(zip(x,y)), columns=["date", "time"])
df_all_dates

output:

    date    time
0   00:00   2017-10-29
1   01:00   2017-10-29
2   02:00   2017-10-29
3   03:00   2017-10-29
4   04:00   2017-10-29
# as an example, assume the first two rows were returned by your sql query
df_sql_query = df_all_dates.head(2)
df_sql_query

output:

    date    time
0   00:00   2017-10-29
1   01:00   2017-10-29

the important bit:

# check if each row in df_all_dates has the same date and time
# as any row from df_sql_query
def passed_threshold(data):
    if ((df_sql_query['date'] == data["date"]) & (df_sql_query['time'] == data["time"])).any():
        return "Failed"
    else:
        return "Passed"

df_all_dates["Result"] = df_all_dates.apply(passed_threshold, axis=1)
df_all_dates

output:

    date    time    Result
0   00:00   2017-10-29  Failed
1   01:00   2017-10-29  Failed
2   02:00   2017-10-29  Passed
3   03:00   2017-10-29  Passed
4   04:00   2017-10-29  Passed

Upvotes: 1

Related Questions