Over-ride one TimeSeries with another

Question

I'd like to over-ride the values of one time series with another. The input series has values at all points. The over-ride time series will have the same index (i.e. dates) but I would only want to over-ride the values at some dates. The way I have thought of specifying this is to have a time series with values where I want to over-ride to that value and NaN where I don't want an over-ride applied.

Perhaps best illustrated with a quick example:

            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11      2
2013-06-01     3   NaN      3
2013-07-01     4     9      4
2013-08-01     2    97      5

# should become

            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11     11
2013-06-01     3   NaN      3
2013-07-01     4     9      9
2013-08-01     2    97     97

As you can see from the example, I don't think the replace or where methods would work as the values of replacement are index location dependent and not input value dependent. Because I want to do this more than once I've put it in a function and I do have a solution that works as shown below:

def overridets(ts, orts):
    tmp = pd.concat([ts, orts], join='outer', axis=1)
    out = tmp.apply(lambda x: x[0] if pd.isnull(x[1]) else x[1], axis=1)
    return out

The issue is that this runs relatively slowly: 20 - 30 ms for a 500 point series in my environment. Multiplying two 500 point series takes ~200 us so we're talking about 100 times slower. Any suggestions on how to pick up the pace?

EDIT

Further to the help from @Andy and @bmu below my final solution to the problem is as follows:

def overridets(ts, orts):

     ts.name = 'outts'
     orts.name = 'orts'
     tmp = pd.concat([ts, orts], join='outer', axis=1)

     out = tmp['outts'].where(pd.isnull(tmp['orts']), tmp['orts'])
     return out

I didn't need inplace=True as this was always wrapped in a function that was to return a single time series. Almost 50 times faster so thanks guys!

Andy Hayden · Accepted Answer

A faster way to copy a column's non-NaN values to another column is to use loc and boolean mask:

In [11]: df1
Out[11]:
            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11      2
2013-06-01     3   NaN      3
2013-07-01     4     9      4
2013-08-01     2    97      5

In [12]: df1.loc[pd.notnull(df1['orts']), 'outts'] = df1['orts']

In [13]: df1
Out[13]:
            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11     11
2013-06-01     3   NaN      3
2013-07-01     4     9      9
2013-08-01     2    97     97

This is significantly faster than your function:

In [21]: df500 = pd.DataFrame(np.random.randn(500, 2), columns=['orts', 'outts'])

In [22]: %timeit overridets(df500['outts'], df500['orts'])
100 loops, best of 3: 14 ms per loop

In [23]: %timeit df500.loc[pd.notnull(df500['orts']), 'outts'] = df500['orts']
1000 loops, best of 3: 400 us per loop

In [24]: df100k = pd.DataFrame(np.random.randn(100000, 2), columns=['orts', 'outts'])

In [25]: %timeit overridets(df100k['outts'], df100k['orts'])
1 loops, best of 3: 2.67 s per loop

In [26]: %timeit df100k.loc[pd.notnull(df100k['orts']), 'outts'] = df100k['orts']
100 loops, best of 3: 9.61 ms per loop

As @bmu points out, you will in fact be better off using where:

In [31]: %timeit df500['outts'].where(pd.isnull(df500['orts']), df['orts'], inplace=True)
1000 loops, best of 3: 281 us per loop

In [32]: %timeit df100k['outts'].where(pd.isnull(df['orts']), df['orts'], inplace=True)
100 loops, best of 3: 2.9 ms per loop

Over-ride one TimeSeries with another

Answers (2)

Related Questions