Reputation: 95
I have a dataframe which can sometime have incomplete data. For example this one below stops at Hour 22 instead of 23
Date Hour Interval Source ID Number of Messages
0 2020-05-19 0 0 1 413379290 23
1 2020-05-19 0 15 1 413379290 36
2 2020-05-19 0 30 1 413379290 31
3 2020-05-19 0 45 1 413379290 14
4 2020-05-19 1 0 1 413379290 3
.. ... ... ... ... ... ...
183 2020-05-20 21 45 1 413379290 6
184 2020-05-20 22 0 1 413379290 8
185 2020-05-20 22 15 1 413379290 4
186 2020-05-20 22 30 1 413379290 6
187 2020-05-20 22 45 1 413379290 9
How do I use pandas so that it looks like this?
Date Hour Interval Source ID Number of Messages
0 2020-05-19 0 0 1 413379290 23
1 2020-05-19 0 15 1 413379290 36
2 2020-05-19 0 30 1 413379290 31
3 2020-05-19 0 45 1 413379290 14
4 2020-05-19 1 0 1 413379290 3
.. ... ... ... ... ... ...
183 2020-05-20 21 45 1 413379290 6
184 2020-05-20 22 0 1 413379290 8
185 2020-05-20 22 15 1 413379290 4
186 2020-05-20 22 30 1 413379290 6
187 2020-05-20 22 45 1 413379290 9
188 2020-05-20 23 0 1 413379290 NaN
189 2020-05-20 23 15 1 413379290 NaN
190 2020-05-20 23 30 1 413379290 NaN
191 2020-05-20 23 45 1 413379290 NaN
Upvotes: 3
Views: 384
Reputation: 8508
The approach I would take is to find the min and max of date, then create a range of dates with 15 minute interval. Use df.merge to add all values from df to the newly created dataframe.
Note here that the date starts from 2020-05-19 01:00:00 and not 00:00:00. So the final output will also start from 01:00:00 and not 00:00:00
import pandas as pd
c = ['Date','Hour','Interval','Source','ID','Number of Messages']
d = [
['2020-05-19', 1, 0, 1, 413379290, 23],
['2020-05-19', 1, 15, 1, 413379290, 36],
['2020-05-19', 1, 30, 1, 413379290, 31],
['2020-05-19', 1, 45, 1, 413379290, 14],
['2020-05-19', 2, 0, 1, 413379290, 3],
['2020-05-20', 21, 45, 1, 413379290, 6],
['2020-05-20', 22, 0, 1, 413379290, 8],
['2020-05-20', 22, 15, 1, 413379290, 4],
['2020-05-20', 22, 30, 1, 413379290, 6],
['2020-05-20', 22, 45, 1, 413379290, 9]]
df = pd.DataFrame(d,columns=c)
df['Date'] = pd.to_datetime(df['Date'])
print (df)
#first get the start and end period by adding Hour and Interval to Date
df['DateFull'] = df.Date + pd.to_timedelta(df.Hour,unit='h') + pd.to_timedelta(df.Interval,unit='m')
#Create a range of dates with 15 mins interval from Start Date (including Hour & Min) to Last Day + 23:45
df1 = pd.DataFrame({'DateFull':pd.date_range(df.DateFull.min(), df.DateFull.max().floor('d') + pd.to_timedelta('23:45:00'), freq='15T')})
#Create columns with Hour and Interval based on new Date Range
df1['Hour'] = df1.DateFull.dt.hour
df1['Interval'] = df1.DateFull.dt.minute
#Merge on DateFull, Hour, Interval to get the full set merged with original DF
df1 = df1.merge(df, how='left', on=['DateFull','Hour','Interval'])
#forward fill Date, Source and ID
df1[['Date','Source','ID']] = df1[['Date','Source','ID']].ffill()
#convert Source and ID to int
df1[['Source','ID']] = df1[['Source','ID']].astype(int)
#Drop DateFull as it is no longer needed
df1.drop(columns ='DateFull',inplace=True)
#Reset index to original column
df1 = df1.reindex(c, axis=1)
print (df1)
Original DataFrame:
Date Hour Interval Source ID Number of Messages
0 2020-05-19 1 0 1 413379290 23
1 2020-05-19 1 15 1 413379290 36
2 2020-05-19 1 30 1 413379290 31
3 2020-05-19 1 45 1 413379290 14
4 2020-05-19 2 0 1 413379290 3
5 2020-05-20 21 45 1 413379290 6
6 2020-05-20 22 0 1 413379290 8
7 2020-05-20 22 15 1 413379290 4
8 2020-05-20 22 30 1 413379290 6
9 2020-05-20 22 45 1 413379290 9
Final DataFrame:
Date Hour Interval Source ID Number of Messages
0 2020-05-19 1 0 1 413379290 23.0
1 2020-05-19 1 15 1 413379290 36.0
2 2020-05-19 1 30 1 413379290 31.0
3 2020-05-19 1 45 1 413379290 14.0
4 2020-05-19 2 0 1 413379290 3.0
.. ... ... ... ... ... ...
183 2020-05-20 22 45 1 413379290 9.0
184 2020-05-20 23 0 1 413379290 NaN
185 2020-05-20 23 15 1 413379290 NaN
186 2020-05-20 23 30 1 413379290 NaN
187 2020-05-20 23 45 1 413379290 NaN
df1.tail(20)
gives you this:
Date Hour Interval Source ID Number of Messages
168 2020-05-19 19 0 1 413379290 NaN
169 2020-05-19 19 15 1 413379290 NaN
170 2020-05-19 19 30 1 413379290 NaN
171 2020-05-19 19 45 1 413379290 NaN
172 2020-05-19 20 0 1 413379290 NaN
173 2020-05-19 20 15 1 413379290 NaN
174 2020-05-19 20 30 1 413379290 NaN
175 2020-05-19 20 45 1 413379290 NaN
176 2020-05-19 21 0 1 413379290 NaN
177 2020-05-19 21 15 1 413379290 NaN
178 2020-05-19 21 30 1 413379290 NaN
179 2020-05-20 21 45 1 413379290 6.0
180 2020-05-20 22 0 1 413379290 8.0
181 2020-05-20 22 15 1 413379290 4.0
182 2020-05-20 22 30 1 413379290 6.0
183 2020-05-20 22 45 1 413379290 9.0
184 2020-05-20 23 0 1 413379290 NaN
185 2020-05-20 23 15 1 413379290 NaN
186 2020-05-20 23 30 1 413379290 NaN
187 2020-05-20 23 45 1 413379290 NaN
Upvotes: 2
Reputation: 9857
You can do this by creating a new dataframe with all the appropriate values for the first five columns and then merge with the original dataframe to get the values from the No of Messages column in the appropriate row.
import pandas as pd
df = pd.read_csv('test.csv')
dates = df['Date'].unique()
hrs = [hr for hr in range(24) for i in range(4)]*len(dates)
intervals = [0, 15, 30, 45] * 24 *len(dates)
new_df = pd.DataFrame()
new_df['Date'] = [dt for dt in dates for i in range(24*4)]
new_df['Hour'] = hrs
new_df['Interval'] = intervals
new_df['Source'] = df['Source'].iloc[0]
new_df['ID'] = str(df['ID'].iloc[0])
new_df = new_df.merge(df, how='left', on=['Date', 'Hour', 'Interval']).drop(['Source_y', 'ID_y'], axis=1)
new_df.rename(columns={'Source_x':'Source', 'ID_x':'ID'},inplace=True)
new_df.to_excel('testit.xlsx')
Upvotes: 1
Reputation: 29635
You can use reindex
and to create the missing hours, you can use existing values in all the columns (except Number of messages), create a MultiIndex.from_frame
, then use MultiIndex.from_product
to recreate all the values changing existing values in Hour by range(24)
. Then on the dataframe, set_index
and reindex
with all values
# all except the one you want nan in
cols = ['Date','Hour', 'Interval', 'Source','ID']
#create the multiindex with all values
new_idx = (
pd.MultiIndex.from_product(
[lv if col != 'Hour' else range(24) #replace existing values by range 0 to 23
for col, lv in zip(cols, pd.MultiIndex.from_frame(df[cols]).levels)
], names=cols)
)
#reindex the original df, you can reassign to same df if you want
new_df = (
df.set_index(cols)
.reindex(new_idx)
.reset_index()
)
Upvotes: 2