Reputation: 6949
Example:
import pandas as pd
import numpy as np
rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts
2000-01-01 00:00:00 0.0
2000-01-01 00:01:00 1.0
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:00 4.0
2000-01-01 00:05:00 5.0
2000-01-01 00:06:00 6.0
2000-01-01 00:07:00 7.0
2000-01-01 00:08:00 8.0
2000-01-01 00:09:00 9.0
2000-01-01 00:10:00 10.0
2000-01-01 00:11:00 11.0
Freq: T, dtype: float64
ts.resample("5min").sum()
2000-01-01 00:00:00 5.0
2000-01-01 00:05:00 30.0
2000-01-01 00:10:00 30.0
Freq: 5T, dtype: float64
In the above example, it extracts the sum of the interval 00:00-00:05 as if the missing value was zero. What I want is for it to produce result NaN in 00:00.
Or, maybe I'd like for it to be OK if there's one missing value in the interval, but NaN if there are two missing values in the interval.
How can I do these?
Upvotes: 11
Views: 8248
Reputation: 1
The performance of .resample().agg(pd.Series.sum, skipna=False)
is much slower than .resample().sum()
, particularly on dataframes with many columns. Obviously these are different methods that produce different results but the core intent is the same, so I would propose the following function that has improved speed.
test = pd.DataFrame(index = pd.date_range('2023-01-01 00:00', periods=16, freq='15T'), data={'A':10})
test.iloc[13] = pd.NA
test.iloc[:6] = pd.NA
test
date A
2023-01-01 00:00:00 NaN
2023-01-01 00:15:00 NaN
2023-01-01 00:30:00 NaN
2023-01-01 00:45:00 NaN
2023-01-01 01:00:00 NaN
2023-01-01 01:15:00 NaN
2023-01-01 01:30:00 10.0
2023-01-01 01:45:00 10.0
2023-01-01 02:00:00 10.0
2023-01-01 02:15:00 10.0
2023-01-01 02:30:00 10.0
2023-01-01 02:45:00 10.0
2023-01-01 03:00:00 10.0
2023-01-01 03:15:00 NaN
2023-01-01 03:30:00 10.0
2023-01-01 03:45:00 10.0
When testing the above dataframe copied to 100 columns it can be seen that the "skipna=False" version is 17X slower.
%%timeit pd.concat([test]*100, axis=1).resample('H').sum()
2.07 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit pd.concat([test]*100, axis=1).resample('H').agg(pd.Series.sum, skipna=False)
34.1 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Beware that the function slightly increases some sums, but this can be controlled by minimizing the nan_number input.
import pandas as pd
def resample_sum_keep_nans(df, target_freq='H', nan_number = 0.01):
"""Function returns a downsampled dataframe that returns NaN for downsampled
intervals when all values in source intervals are NaN.
Input df should have a time-series index that 'is_monotonic_increasing'
and has a defined frequency.
Select a 'nan_number' value that is very small relative to the input
df values to avoid significant alteration of the output downsampled totals
for source intervals that have some but not all NaN values. Refer to Pandas
resample method for orientation on defining 'target_freq' of output df"""
assert df.index.is_monotonic_increasing
assert df.index.freq is not None
# confirm that the suggested nan_number is not present in the df already
# this is important because the temporary fill-in nan_number must be easily distinguished
# from the values already present in df.
if df.sample(frac=0.5).dropna(how='all').eq(nan_number).any().any():
print(f'nan_number {nan_number} exists in input df. '
'Input a nan_number that is not present in df')
else:
# determine what the temporary nan_number will be after being summed
# across intervals in downsampling
old_delta = df.iloc[:,0].index.freq.delta
new_delta = df.iloc[:,0].resample(target_freq).sum().index.freq.delta
freq_multiplier = new_delta / old_delta
# define the number to search for and replace with NaN in the resampled df
nan_number_resampled = freq_multiplier * nan_number
# print(nan_number_resampled)
# fill NaNs with nan_number & resample
df = df.fillna(nan_number).resample(target_freq)\
.sum()
# fill any values equal to nan_number_resampled with NaN and return df
return df.mask(df.eq(nan_number_resampled), pd.NA)
%%timeit resample_sum_keep_nans(pd.concat([test]*100, axis=1))
3.94 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Note that test.resample('H').agg(pd.Series.sum, skipna=False)
discards values from source intervals that contain NaN (2023-01-01 01:00:00).
A
date
2023-01-01 00:00:00 NaN
2023-01-01 01:00:00 NaN
2023-01-01 02:00:00 40.0
2023-01-01 03:00:00 NaN
While the resample_sum_keep_nans(test, nan_number=0.00001).round(0)
recognizes and sums values that are mixed with NaNs in source intervals.
A
date
2023-01-01 00:00:00 NaN
2023-01-01 01:00:00 20.0
2023-01-01 02:00:00 40.0
2023-01-01 03:00:00 30.0
Upvotes: 0
Reputation: 164783
For one or more NaN
values:
ts.resample('5min').agg(pd.Series.sum, skipna=False)
For a minimum of 2 non-NaN
values:
ts.resample('5min').agg(pd.Series.sum, min_count=2)
For a maximum of 2 NaN
values seems tricker:
ts.resample('5min').apply(lambda x: x.sum() if x.isnull().sum() <= 2 else np.nan)
You might expect ts.resample('5min').sum(skipna=False)
to work in the same way as ts.sum(skipna=False)
, but the implementations are not consistent.
Upvotes: 19