Antonis Christofides
Antonis Christofides

Reputation: 6949

While resampling, put NaN in the resulting value if there are some NaN values in the source interval

Example:

import pandas as pd
import numpy as np

rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts
2000-01-01 00:00:00     0.0
2000-01-01 00:01:00     1.0
2000-01-01 00:02:00     NaN
2000-01-01 00:03:00     3.0
2000-01-01 00:04:00     4.0
2000-01-01 00:05:00     5.0
2000-01-01 00:06:00     6.0
2000-01-01 00:07:00     7.0
2000-01-01 00:08:00     8.0
2000-01-01 00:09:00     9.0
2000-01-01 00:10:00    10.0
2000-01-01 00:11:00    11.0
Freq: T, dtype: float64
ts.resample("5min").sum()
2000-01-01 00:00:00     5.0
2000-01-01 00:05:00    30.0
2000-01-01 00:10:00    30.0
Freq: 5T, dtype: float64

In the above example, it extracts the sum of the interval 00:00-00:05 as if the missing value was zero. What I want is for it to produce result NaN in 00:00.

Or, maybe I'd like for it to be OK if there's one missing value in the interval, but NaN if there are two missing values in the interval.

How can I do these?

Upvotes: 11

Views: 8248

Answers (2)

emskiphoto
emskiphoto

Reputation: 1

The performance of .resample().agg(pd.Series.sum, skipna=False) is much slower than .resample().sum(), particularly on dataframes with many columns. Obviously these are different methods that produce different results but the core intent is the same, so I would propose the following function that has improved speed.

test = pd.DataFrame(index = pd.date_range('2023-01-01 00:00', periods=16, freq='15T'), data={'A':10})
test.iloc[13] = pd.NA
test.iloc[:6] = pd.NA
test

               date    A
2023-01-01 00:00:00  NaN
2023-01-01 00:15:00  NaN
2023-01-01 00:30:00  NaN
2023-01-01 00:45:00  NaN
2023-01-01 01:00:00  NaN
2023-01-01 01:15:00  NaN
2023-01-01 01:30:00 10.0
2023-01-01 01:45:00 10.0
2023-01-01 02:00:00 10.0
2023-01-01 02:15:00 10.0
2023-01-01 02:30:00 10.0
2023-01-01 02:45:00 10.0
2023-01-01 03:00:00 10.0
2023-01-01 03:15:00  NaN
2023-01-01 03:30:00 10.0
2023-01-01 03:45:00 10.0

When testing the above dataframe copied to 100 columns it can be seen that the "skipna=False" version is 17X slower.

%%timeit pd.concat([test]*100, axis=1).resample('H').sum() 2.07 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit pd.concat([test]*100, axis=1).resample('H').agg(pd.Series.sum, skipna=False) 34.1 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Beware that the function slightly increases some sums, but this can be controlled by minimizing the nan_number input.

import pandas as pd

def resample_sum_keep_nans(df, target_freq='H', nan_number = 0.01):
    """Function returns a downsampled dataframe that returns NaN for downsampled
    intervals when all values in source intervals are NaN.
    Input df should have a time-series index that 'is_monotonic_increasing'
    and has a defined frequency.
    Select a 'nan_number' value that is very small relative to the input 
    df values to avoid significant alteration of the output downsampled totals
    for source intervals that have some but not all NaN values.  Refer to Pandas
    resample method for orientation on defining 'target_freq' of output df"""
    assert df.index.is_monotonic_increasing
    assert df.index.freq is not None
#     confirm that the suggested nan_number is not present in the df already
# this is important because the temporary fill-in nan_number must be easily distinguished
# from the values already present in df.
    if df.sample(frac=0.5).dropna(how='all').eq(nan_number).any().any():
        print(f'nan_number {nan_number} exists in input df.  '
              'Input a nan_number that is not present in df')
    else:
#         determine what the temporary nan_number will be after being summed 
        # across intervals in downsampling
        old_delta = df.iloc[:,0].index.freq.delta
        new_delta = df.iloc[:,0].resample(target_freq).sum().index.freq.delta
        freq_multiplier = new_delta / old_delta
# define the number to search for and replace with NaN in the resampled df
        nan_number_resampled = freq_multiplier * nan_number
#         print(nan_number_resampled)
# fill NaNs with nan_number & resample
        df = df.fillna(nan_number).resample(target_freq)\
            .sum()
# fill any values equal to nan_number_resampled with NaN and return df
        return df.mask(df.eq(nan_number_resampled), pd.NA)

%%timeit resample_sum_keep_nans(pd.concat([test]*100, axis=1)) 3.94 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Note that test.resample('H').agg(pd.Series.sum, skipna=False) discards values from source intervals that contain NaN (2023-01-01 01:00:00).

                        A
date                     
2023-01-01 00:00:00   NaN
2023-01-01 01:00:00   NaN
2023-01-01 02:00:00  40.0
2023-01-01 03:00:00   NaN

While the resample_sum_keep_nans(test, nan_number=0.00001).round(0) recognizes and sums values that are mixed with NaNs in source intervals.

                        A
date                     
2023-01-01 00:00:00   NaN
2023-01-01 01:00:00  20.0
2023-01-01 02:00:00  40.0
2023-01-01 03:00:00  30.0

Upvotes: 0

jpp
jpp

Reputation: 164783

For one or more NaN values:

ts.resample('5min').agg(pd.Series.sum, skipna=False)

For a minimum of 2 non-NaN values:

ts.resample('5min').agg(pd.Series.sum, min_count=2)

For a maximum of 2 NaN values seems tricker:

ts.resample('5min').apply(lambda x: x.sum() if x.isnull().sum() <= 2 else np.nan)

You might expect ts.resample('5min').sum(skipna=False) to work in the same way as ts.sum(skipna=False), but the implementations are not consistent.

Upvotes: 19

Related Questions