cslurker31
cslurker31

Reputation: 95

Python add missing rows to dataframe

I have a dataframe which can sometime have incomplete data. For example this one below stops at Hour 22 instead of 23

           Date  Hour  Interval     Source         ID  Number of Messages
0    2020-05-19     0         0          1  413379290                  23
1    2020-05-19     0        15          1  413379290                  36
2    2020-05-19     0        30          1  413379290                  31
3    2020-05-19     0        45          1  413379290                  14
4    2020-05-19     1         0          1  413379290                   3
..          ...   ...       ...        ...        ...                 ...
183  2020-05-20    21        45          1  413379290                   6
184  2020-05-20    22         0          1  413379290                   8
185  2020-05-20    22        15          1  413379290                   4
186  2020-05-20    22        30          1  413379290                   6
187  2020-05-20    22        45          1  413379290                   9

How do I use pandas so that it looks like this?

           Date  Hour  Interval     Source         ID  Number of Messages
0    2020-05-19     0         0          1  413379290                  23
1    2020-05-19     0        15          1  413379290                  36
2    2020-05-19     0        30          1  413379290                  31
3    2020-05-19     0        45          1  413379290                  14
4    2020-05-19     1         0          1  413379290                   3
..          ...   ...       ...        ...        ...                 ...
183  2020-05-20    21        45          1  413379290                   6
184  2020-05-20    22         0          1  413379290                   8
185  2020-05-20    22        15          1  413379290                   4
186  2020-05-20    22        30          1  413379290                   6
187  2020-05-20    22        45          1  413379290                   9
188  2020-05-20    23         0          1  413379290                   NaN
189  2020-05-20    23        15          1  413379290                   NaN
190  2020-05-20    23        30          1  413379290                   NaN
191  2020-05-20    23        45          1  413379290                   NaN

Upvotes: 3

Views: 384

Answers (3)

Joe Ferndz
Joe Ferndz

Reputation: 8508

The approach I would take is to find the min and max of date, then create a range of dates with 15 minute interval. Use df.merge to add all values from df to the newly created dataframe.

Note here that the date starts from 2020-05-19 01:00:00 and not 00:00:00. So the final output will also start from 01:00:00 and not 00:00:00

import pandas as pd
c = ['Date','Hour','Interval','Source','ID','Number of Messages']
d = [
['2020-05-19',     1,         0,          1,  413379290,                  23],
['2020-05-19',     1,        15,          1,  413379290,                  36],
['2020-05-19',     1,        30,          1,  413379290,                  31],
['2020-05-19',     1,        45,          1,  413379290,                  14],
['2020-05-19',     2,         0,          1,  413379290,                   3],
['2020-05-20',    21,        45,          1,  413379290,                   6],
['2020-05-20',    22,         0,          1,  413379290,                   8],
['2020-05-20',    22,        15,          1,  413379290,                   4],
['2020-05-20',    22,        30,          1,  413379290,                   6],
['2020-05-20',    22,        45,          1,  413379290,                   9]]

df = pd.DataFrame(d,columns=c)
df['Date'] = pd.to_datetime(df['Date'])

print (df)

#first get the start and end period by adding Hour and Interval to Date
df['DateFull'] = df.Date + pd.to_timedelta(df.Hour,unit='h') + pd.to_timedelta(df.Interval,unit='m')

#Create a range of dates with 15 mins interval from Start Date (including Hour & Min) to Last Day + 23:45
df1 = pd.DataFrame({'DateFull':pd.date_range(df.DateFull.min(), df.DateFull.max().floor('d') + pd.to_timedelta('23:45:00'), freq='15T')})

#Create columns with Hour and Interval based on new Date Range
df1['Hour'] = df1.DateFull.dt.hour
df1['Interval'] = df1.DateFull.dt.minute

#Merge on DateFull, Hour, Interval to get the full set merged with original DF
df1 = df1.merge(df, how='left', on=['DateFull','Hour','Interval'])

#forward fill Date, Source and ID 
df1[['Date','Source','ID']] = df1[['Date','Source','ID']].ffill()

#convert Source and ID to int
df1[['Source','ID']] = df1[['Source','ID']].astype(int)

#Drop DateFull as it is no longer needed
df1.drop(columns ='DateFull',inplace=True)

#Reset index to original column 
df1 = df1.reindex(c, axis=1)

print (df1)

Original DataFrame:

        Date  Hour  Interval  Source         ID  Number of Messages
0 2020-05-19     1         0       1  413379290                  23
1 2020-05-19     1        15       1  413379290                  36
2 2020-05-19     1        30       1  413379290                  31
3 2020-05-19     1        45       1  413379290                  14
4 2020-05-19     2         0       1  413379290                   3
5 2020-05-20    21        45       1  413379290                   6
6 2020-05-20    22         0       1  413379290                   8
7 2020-05-20    22        15       1  413379290                   4
8 2020-05-20    22        30       1  413379290                   6
9 2020-05-20    22        45       1  413379290                   9

Final DataFrame:

          Date  Hour  Interval  Source         ID  Number of Messages
0   2020-05-19     1         0       1  413379290                23.0
1   2020-05-19     1        15       1  413379290                36.0
2   2020-05-19     1        30       1  413379290                31.0
3   2020-05-19     1        45       1  413379290                14.0
4   2020-05-19     2         0       1  413379290                 3.0
..         ...   ...       ...     ...        ...                 ...
183 2020-05-20    22        45       1  413379290                 9.0
184 2020-05-20    23         0       1  413379290                 NaN
185 2020-05-20    23        15       1  413379290                 NaN
186 2020-05-20    23        30       1  413379290                 NaN
187 2020-05-20    23        45       1  413379290                 NaN

df1.tail(20) gives you this:

          Date  Hour  Interval  Source         ID  Number of Messages
168 2020-05-19    19         0       1  413379290                 NaN
169 2020-05-19    19        15       1  413379290                 NaN
170 2020-05-19    19        30       1  413379290                 NaN
171 2020-05-19    19        45       1  413379290                 NaN
172 2020-05-19    20         0       1  413379290                 NaN
173 2020-05-19    20        15       1  413379290                 NaN
174 2020-05-19    20        30       1  413379290                 NaN
175 2020-05-19    20        45       1  413379290                 NaN
176 2020-05-19    21         0       1  413379290                 NaN
177 2020-05-19    21        15       1  413379290                 NaN
178 2020-05-19    21        30       1  413379290                 NaN
179 2020-05-20    21        45       1  413379290                 6.0
180 2020-05-20    22         0       1  413379290                 8.0
181 2020-05-20    22        15       1  413379290                 4.0
182 2020-05-20    22        30       1  413379290                 6.0
183 2020-05-20    22        45       1  413379290                 9.0
184 2020-05-20    23         0       1  413379290                 NaN
185 2020-05-20    23        15       1  413379290                 NaN
186 2020-05-20    23        30       1  413379290                 NaN
187 2020-05-20    23        45       1  413379290                 NaN

Upvotes: 2

norie
norie

Reputation: 9857

You can do this by creating a new dataframe with all the appropriate values for the first five columns and then merge with the original dataframe to get the values from the No of Messages column in the appropriate row.

import pandas as pd

df = pd.read_csv('test.csv')

dates = df['Date'].unique()

hrs = [hr for hr in range(24) for i in range(4)]*len(dates)

intervals = [0, 15, 30, 45] * 24 *len(dates)

new_df = pd.DataFrame()
new_df['Date'] = [dt for dt in dates for i in range(24*4)]

new_df['Hour'] = hrs
new_df['Interval'] = intervals
new_df['Source'] = df['Source'].iloc[0]
new_df['ID'] = str(df['ID'].iloc[0])

new_df = new_df.merge(df, how='left', on=['Date', 'Hour', 'Interval']).drop(['Source_y', 'ID_y'], axis=1)
new_df.rename(columns={'Source_x':'Source', 'ID_x':'ID'},inplace=True)

new_df.to_excel('testit.xlsx')

Upvotes: 1

Ben.T
Ben.T

Reputation: 29635

You can use reindex and to create the missing hours, you can use existing values in all the columns (except Number of messages), create a MultiIndex.from_frame, then use MultiIndex.from_product to recreate all the values changing existing values in Hour by range(24). Then on the dataframe, set_index and reindex with all values

# all except the one you want nan in
cols = ['Date','Hour', 'Interval', 'Source','ID']
#create the multiindex with all values
new_idx = (
    pd.MultiIndex.from_product(
        [lv if col != 'Hour' else range(24) #replace existing values by range 0 to 23
         for col, lv in zip(cols, pd.MultiIndex.from_frame(df[cols]).levels)
         ], names=cols)
)
#reindex the original df, you can reassign to same df if you want
new_df = (
    df.set_index(cols)
      .reindex(new_idx)
      .reset_index()
)

Upvotes: 2

Related Questions