Reputation: 892
I have a dataset with like 100K+ rows, one column on this dataset is a Datetime column, let's name it A
.
My Dataset is sorted by column A.
I want to "Fill gaps" of my Dataset, i.e : if i have these two rows following each others :
0 2019-03-13 08:12:20
1 2019-03-13 08:12:25
I want to make add missing seconds between them, as a result, i'll have this :
0 2019-03-13 08:12:20
1 2019-03-13 08:12:21
2 2019-03-13 08:12:22
3 2019-03-13 08:12:23
4 2019-03-13 08:12:24
5 2019-03-13 08:12:25
I don't want to generate rows between two rows if they have different day, month or year.
So if have these two consecutive rows :
0 2019-03-13 08:12:20
1 2019-03-15 08:12:21
I won't add anything.
I can't also generate rows if the time difference between my two rows is greater than 2 hours.
So if have these two consecutive rows :
0 2019-03-13 08:12:20
1 2019-03-15 11:12:21
I won't add anything.
Here's an example to illustrate what i want :
df=pd.DataFrame({'A': ["2019-03-13 08:12:20", "2019-03-13 08:12:25", "2019-03-20 08:17:23", "2019-03-22 08:17:25", "2019-03-22 11:12:20", "2019-03-22 11:12:23", "2019-03-24 12:33:23"]})
A
0 2019-03-13 08:12:20
1 2019-03-13 08:12:25
2 2019-03-20 08:17:23
3 2019-03-22 08:17:25
4 2019-03-22 11:12:20
5 2019-03-22 11:12:23
6 2019-03-24 12:33:23
At the end, i want to have this result :
A
0 2019-03-13 08:12:20
1 2019-03-13 08:12:21
2 2019-03-13 08:12:22
3 2019-03-13 08:12:23
4 2019-03-13 08:12:24
5 2019-03-13 08:12:25
6 2019-03-20 08:17:23
7 2019-03-22 08:17:25
8 2019-03-22 11:12:20
9 2019-03-22 11:12:21
10 2019-03-22 11:12:22
11 2019-03-22 11:12:23
12 2019-03-24 12:33:23
I tried with this :
import pandas as pd
df=pd.DataFrame({'A': ["2019-03-13 08:12:20", "2019-03-13 08:12:25", "2019-03-20 08:17:23", "2019-03-22 08:17:25", "2019-03-22 11:12:20", "2019-03-22 11:12:23", "2019-03-24 12:33:23"]})
df['A']=pd.to_datetime(df['A'])
fill = [pd.date_range(df.iloc[i]['A'], df.iloc[i+1]['A'], freq='S') for i in range(len(df)-1) if (df.iloc[i+1]['A']-df.iloc[i]['A']).total_seconds()<=7200]
dates = [item for sublist in fill for item in sublist]
df=df.set_index('A').join(pd.DataFrame(index=pd.Index(dates, name='A')), how='outer').reset_index()
print(df)
It's doing the job, but it's slow, is there any faster way to do this ?
Upvotes: 2
Views: 377
Reputation: 29635
You can create a column with a group number where the difference between two consecutive rows are below 2 hours, using diff
and cumsum
. Then set_index
the column A to be able to resample
per group and reset_index
to select the column you want.
df['gr'] = df.A.diff().gt(pd.Timedelta(hours=2)).cumsum()
df_output = df.set_index('A').groupby('gr', as_index=False).resample('s').sum().reset_index()[['A']]
print (df_output)
A
0 2019-03-13 08:12:20
1 2019-03-13 08:12:21
2 2019-03-13 08:12:22
3 2019-03-13 08:12:23
4 2019-03-13 08:12:24
5 2019-03-13 08:12:25
6 2019-03-20 08:17:23
7 2019-03-22 08:17:25
8 2019-03-22 11:12:20
9 2019-03-22 11:12:21
10 2019-03-22 11:12:22
11 2019-03-22 11:12:23
12 2019-03-24 12:33:23
Upvotes: 2