Reputation: 368
I am trying to resample a time series that contains duplicate values. I want to resample the time series to include a time point at every 0.1 seconds. For the new time points I'm hoping to insert NaN values into these created rows and keep the existing rows as is.
import pandas as pd
import numpy as np
d1 = ({
'Value' : ['A','A',np.nan,np.nan,'B','B','B'],
'Other' : ['X','X',np.nan,np.nan,'X','X',np.nan],
'Col' : [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.4','2019-08-02 09:50:10.7','2019-08-02 09:50:10.7','2019-08-02 09:50:10.7','2019-08-02 09:50:10.8'],
'Count' : [1,1,np.nan,5,6,7,8],
})
df1 = pd.DataFrame(data = d1)
df1['Time'] = pd.to_datetime(df1['Time'])
df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
.unstack()
.asfreq('0.1S', method ='pad')
.stack()
.reset_index(level=1, drop=True)
.sort_index()
.reset_index())
Output:
Time Value Other Col Count
0 2019-08-02 09:50:10.100 A X NaN 1.0
1 2019-08-02 09:50:10.200 A X NaN 1.0
2 2019-08-02 09:50:10.300 A X NaN 1.0
3 2019-08-02 09:50:10.700 NaN NaN NaN 5.0
4 2019-08-02 09:50:10.700 B X NaN 6.0
5 2019-08-02 09:50:10.700 B X NaN 7.0
6 2019-08-02 09:50:10.800 B NaN NaN 8.0
Intended Output:
Time Value Other Col Count
0 2019-08-02 09:50:10.100 A X NaN 1.0
1 2019-08-02 09:50:10.200 A X NaN 1.0
2 2019-08-02 09:50:10.300 NaN NaN NaN NaN
3 2019-08-02 09:50:10.400 NaN NaN NaN NaN
4 2019-08-02 09:50:10.500 NaN NaN NaN NaN
5 2019-08-02 09:50:10.600 NaN NaN NaN NaN
6 2019-08-02 09:50:10.700 NaN NaN NaN 5.0
7 2019-08-02 09:50:10.700 B X NaN 6.0
8 2019-08-02 09:50:10.700 B X NaN 7.0
9 2019-08-02 09:50:10.800 B NaN NaN 8.0
Upvotes: 2
Views: 868
Reputation: 22503
The problem is stack()
defaults to dropna=True
. You can change it and do another boolean mask by duplicated
after:
df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
.unstack()
.asfreq('0.1S', method ='pad')
.stack(dropna=False) #change dropna to False
.reset_index(level=1, drop=True)
.sort_index()
.reset_index())
print (df1[~df1.duplicated(["Value","Other","Col","Time","Count"], keep=False)|~df1['Time'].duplicated(keep='first')])
Time Value Other Col Count
0 2019-08-02 09:50:10.100 A X NaN 1.0
3 2019-08-02 09:50:10.200 A X NaN 1.0
6 2019-08-02 09:50:10.300 A X NaN 1.0
9 2019-08-02 09:50:10.400 NaN NaN NaN NaN
12 2019-08-02 09:50:10.500 NaN NaN NaN NaN
15 2019-08-02 09:50:10.600 NaN NaN NaN NaN
18 2019-08-02 09:50:10.700 NaN NaN NaN 5.0
19 2019-08-02 09:50:10.700 B X NaN 6.0
20 2019-08-02 09:50:10.700 B X NaN 7.0
21 2019-08-02 09:50:10.800 B NaN NaN 8.0
Upvotes: 2
Reputation: 71580
Try using:
df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
.unstack()
.asfreq('100ms', method ='pad')
.stack()
.reset_index(level=1, drop=True)
.sort_index()
.reset_index())
dr = pd.date_range(df1['Time'].iloc[0], df1['Time'].iloc[-1], freq='100ms')
df2 = pd.DataFrame({'Time': dr[~dr.isin(df1['Time'])]}, columns = df1.columns)
print(pd.concat([df1,df2]).sort_values('Time').reset_index(drop=True))
Output:
Time Col Count Other Value
0 2019-08-02 09:50:10.100 NaN 1.0 X A
1 2019-08-02 09:50:10.200 NaN 1.0 X A
2 2019-08-02 09:50:10.300 NaN 1.0 X A
3 2019-08-02 09:50:10.400 NaN NaN NaN NaN
4 2019-08-02 09:50:10.500 NaN NaN NaN NaN
5 2019-08-02 09:50:10.600 NaN NaN NaN NaN
6 2019-08-02 09:50:10.700 NaN 5.0 NaN NaN
7 2019-08-02 09:50:10.700 NaN 6.0 X B
8 2019-08-02 09:50:10.700 NaN 7.0 X B
9 2019-08-02 09:50:10.800 NaN 8.0 NaN B
As you see, I added the last three lines of code ^, I simply create a new data frame df2
, which conditions the datetimes not in df1
, and assign the rest of the columns to NaN
, at the end, I concatenate the two data frames and sort it by the datetime then reset the index and there you go.
Upvotes: 2