Frederic
Frederic

Reputation: 281

Pandas - Add at least one row for every day (datetimes include a time)

Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by @kosnik. In addition, their solution doesn't need your dates to be the index!

I have data formatted like this

df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
                        ['2017-02-15 16:33:00', 'Scott', '10'],
                        ['2017-02-15 16:45:00', 'Steve', '5']],
                  columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')

              Datetime Sender Count
0  2017-02-12 20:25:00    Sam     8
1  2017-02-15 16:33:00  Scott    10
2  2017-02-15 16:45:00  Steve     5

I need there to be at least one row for every date, so the expected result would be

              Datetime  Sender Count
0  2017-02-12 20:25:00    Sam      8
1  2017-02-13 00:00:00   None      0
2  2017-02-14 00:00:00   None      0
3  2017-02-15 16:33:00  Scott     10
4  2017-02-15 16:45:00  Steve      5

I have tried to make datetime the index, add the dates and use reindex() like so

df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
    if values[i].date() + timedelta < values[i+1].date():
        values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))

This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()

                    ID Sender Count
Datetime
2017-02-12 16:25:00  0    Sam     8
2017-02-13 00:00:00  0      0     0
2017-02-14 00:00:00  0      0     0
2017-02-15 20:25:00  0      0     0
2017-02-15 20:25:00  0      0     0

What would be the appropriate way of going about this?

Upvotes: 2

Views: 1143

Answers (2)

G.G
G.G

Reputation: 765

col1=df.Datetime.dt.date.map(pd.to_datetime)
col2=col1.duplicated()
dd1=df.assign(col1=col1).loc[~col2].resample('D',on='col1').agg('first').reset_index()
pd.concat([dd1,df.loc[col2]]).sql.select('ifnull(Datetime,col1) Datetime,Sender,ifnull(Count,0) Count').df()


              Datetime  Sender Count
0  2017-02-12 20:25:00    Sam      8
1  2017-02-13 00:00:00   None      0
2  2017-02-14 00:00:00   None      0
3  2017-02-15 16:33:00  Scott     10
4  2017-02-15 16:45:00  Steve      5

Upvotes: 0

kosnik
kosnik

Reputation: 2434

I would create a new DataFrame column which contains all the required data and then left join with your data frame.

A working code example is the following

df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates

min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
    forward_date = min_date + datetime.timedelta(d)
    if forward_date not in dates:
        datetimes.append(np.datetime64(forward_date))

# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)

Upvotes: 2

Related Questions