jonboy
jonboy

Reputation: 368

Resample time series with duplicate values

I am trying to resample a time series that contains duplicate values. I want to resample the time series to include a time point at every 0.1 seconds. For the new time points I'm hoping to insert NaN values into these created rows and keep the existing rows as is.

import pandas as pd
import numpy as np

d1 = ({   
    'Value' : ['A','A',np.nan,np.nan,'B','B','B'],
    'Other' : ['X','X',np.nan,np.nan,'X','X',np.nan],  
    'Col' : [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],                          
    'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.4','2019-08-02 09:50:10.7','2019-08-02 09:50:10.7','2019-08-02 09:50:10.7','2019-08-02 09:50:10.8'],
    'Count' : [1,1,np.nan,5,6,7,8],
    })

df1 = pd.DataFrame(data = d1)

df1['Time'] = pd.to_datetime(df1['Time'])

df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
        .unstack()
        .asfreq('0.1S', method ='pad')
        .stack()
        .reset_index(level=1, drop=True)
        .sort_index()
        .reset_index())

Output:

                     Time Value Other  Col  Count
0 2019-08-02 09:50:10.100     A     X  NaN    1.0
1 2019-08-02 09:50:10.200     A     X  NaN    1.0
2 2019-08-02 09:50:10.300     A     X  NaN    1.0
3 2019-08-02 09:50:10.700   NaN   NaN  NaN    5.0
4 2019-08-02 09:50:10.700     B     X  NaN    6.0
5 2019-08-02 09:50:10.700     B     X  NaN    7.0
6 2019-08-02 09:50:10.800     B   NaN  NaN    8.0

Intended Output:

                     Time Value Other    Col  Count
0 2019-08-02 09:50:10.100     A     X    NaN    1.0
1 2019-08-02 09:50:10.200     A     X    NaN    1.0
2 2019-08-02 09:50:10.300     NaN   NaN  NaN    NaN
3 2019-08-02 09:50:10.400     NaN   NaN  NaN    NaN
4 2019-08-02 09:50:10.500     NaN   NaN  NaN    NaN
5 2019-08-02 09:50:10.600     NaN   NaN  NaN    NaN
6 2019-08-02 09:50:10.700     NaN   NaN  NaN    5.0
7 2019-08-02 09:50:10.700     B     X    NaN    6.0
8 2019-08-02 09:50:10.700     B     X    NaN    7.0
9 2019-08-02 09:50:10.800     B     NaN  NaN    8.0

Upvotes: 2

Views: 868

Answers (2)

Henry Yik
Henry Yik

Reputation: 22503

The problem is stack() defaults to dropna=True. You can change it and do another boolean mask by duplicated after:

df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
        .unstack()
        .asfreq('0.1S', method ='pad')
        .stack(dropna=False) #change dropna to False
        .reset_index(level=1, drop=True)
        .sort_index()
        .reset_index())

print (df1[~df1.duplicated(["Value","Other","Col","Time","Count"], keep=False)|~df1['Time'].duplicated(keep='first')])

                      Time Value Other  Col  Count
0  2019-08-02 09:50:10.100     A     X  NaN    1.0
3  2019-08-02 09:50:10.200     A     X  NaN    1.0
6  2019-08-02 09:50:10.300     A     X  NaN    1.0
9  2019-08-02 09:50:10.400   NaN   NaN  NaN    NaN
12 2019-08-02 09:50:10.500   NaN   NaN  NaN    NaN
15 2019-08-02 09:50:10.600   NaN   NaN  NaN    NaN
18 2019-08-02 09:50:10.700   NaN   NaN  NaN    5.0
19 2019-08-02 09:50:10.700     B     X  NaN    6.0
20 2019-08-02 09:50:10.700     B     X  NaN    7.0
21 2019-08-02 09:50:10.800     B   NaN  NaN    8.0

Upvotes: 2

U13-Forward
U13-Forward

Reputation: 71580

Try using:

df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
        .unstack()
        .asfreq('100ms', method ='pad')
        .stack()
        .reset_index(level=1, drop=True)
        .sort_index()
        .reset_index())
dr = pd.date_range(df1['Time'].iloc[0], df1['Time'].iloc[-1], freq='100ms')
df2 = pd.DataFrame({'Time': dr[~dr.isin(df1['Time'])]}, columns = df1.columns)
print(pd.concat([df1,df2]).sort_values('Time').reset_index(drop=True))

Output:

                     Time  Col  Count Other Value
0 2019-08-02 09:50:10.100  NaN    1.0     X     A
1 2019-08-02 09:50:10.200  NaN    1.0     X     A
2 2019-08-02 09:50:10.300  NaN    1.0     X     A
3 2019-08-02 09:50:10.400  NaN    NaN   NaN   NaN
4 2019-08-02 09:50:10.500  NaN    NaN   NaN   NaN
5 2019-08-02 09:50:10.600  NaN    NaN   NaN   NaN
6 2019-08-02 09:50:10.700  NaN    5.0   NaN   NaN
7 2019-08-02 09:50:10.700  NaN    6.0     X     B
8 2019-08-02 09:50:10.700  NaN    7.0     X     B
9 2019-08-02 09:50:10.800  NaN    8.0   NaN     B

As you see, I added the last three lines of code ^, I simply create a new data frame df2, which conditions the datetimes not in df1, and assign the rest of the columns to NaN, at the end, I concatenate the two data frames and sort it by the datetime then reset the index and there you go.

Upvotes: 2

Related Questions