merit_2
merit_2

Reputation: 471

Resampling dataframe is producing unexpected results

Long question short, what is an appropriate resampling freq/rule? Sometimes I get a dataframe mostly filled with NaNs, sometimes it works great. I thought I had a handle on it.

Below is an example,

I am processing a lot of data and was changing my resample frequency and notice that for reason certain resample rules produce only 1 element in each row to have a value, the rest of elements to have NaN's.

For example,

df = pd.DataFrame()
df['date']=pd.date_range(start='1/1/2018', end='5/08/2018')

Creating some example data,

df['data1']=np.random.randint(1, 10, df.shape[0])
df['data2']=np.random.randint(1, 10, df.shape[0])
df['data3'] = np.arange(len(df))

The data looks like,

print(df.head())
print(df.shape)

            data1  data2  data3
date                           
2018-01-01      7      7      0
2018-01-02      8      8      1
2018-01-03      2      7      2
2018-01-04      2      2      3
2018-01-05      2      5      4
(128, 3)

When I resample the data using offset aliases I get an unexpected results.

Below I resample the data every 3 minutes.

resampled=df.resample('3T').mean()

print(resampled.head())
print(resampled.shape)

                     data1  data2  data3
date                                    
2018-01-01 00:00:00    4.0    5.0    0.0
2018-01-01 00:03:00    NaN    NaN    NaN
2018-01-01 00:06:00    NaN    NaN    NaN
2018-01-01 00:09:00    NaN    NaN    NaN
2018-01-01 00:12:00    NaN    NaN    NaN

Most of the rows are filled with NaN besides the first. I believe this due to that there is no index for my resampling rule. Is this correct? '24H' is the smallest interval for this data, but anything less leaves NaN in a row.

Can a dataframe be resampled for increments less than the datetime resolution?

I have had trouble in the past trying to resample a large dataset that spanned over a year with the datetime index formatted as %Y:%j:%H:%M:%S (year:day #: hour: minute:second, note: close enough without being verbose). Attempting to resample every 15 or 30 days also produced very similar results with NaNs. I thought it was due to having an odd date format with no month, but df.head() showed the index with correct dates.

Upvotes: 0

Views: 774

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30971

When you resample lowering the frequency (downsample), then one of possible options to compute the result is just mean(). It actuaaly means:

  • The source DataFrame contains too detailed data.
  • You want to change the sampling frequency to some lower one and compute e.g. a mean of each column from some number of source rows for the current sampling period.

But when you increase the sampling frequency (upsample), then:

  • Your source data are too general.
  • You want to change the frequency to a higher one.
  • One of possible options to compute the result is e.g. to interpolate between known source values.

Note that when you upsample daily data to 3-minute frequency then:

  • The first row will contain data between 2018-01-01 00:00:00 and 2018-01-01 00:03:00.
  • The next row will contain data between 2018-01-01 00:03:00 and 2018-01-01 00:06:00.
  • And so on.

So, based on your source data:

  • The first row contains data from 2018-01-01 (sharp on midnight).
  • Since no source data is available for the time range between 00:03:00 and 00:06:00 (on 2018-01-01), the second row contains just NaN values.
  • The same pertains to further rows, up to 2018-01-01 23:57:00 (no source data for these time slices).
  • The next row, for 2018-01-02 00:00:00 can be filled with source data.
  • And so on.

There is nothing strange in this behaviour. Resample works just this way. As you actually upsample the source data, maybe you should interpolate the missing values?

Upvotes: 2

Related Questions