Recalculating mean and std avg (Python, Pandas)

Question

The problem: I need to recalculate mean and std dev for last n minutes every minute.

That is, if we assume n == 3, then I have 3 data frames for say minutes 12:01, 12:02, 12:03. At 12:04 calculate mean, std dev for those last 3 minutes.

At 12:05 I need to recalculate mean and std dev of dataframes for 12:02, 12:03, 12:04.

Now I can concat the last 3 dataframes on a new minute passing and then calculate what I need. But that means I'm needlessly recalculating every dataframe n-1 times.

Is there a way of "suspending" computation on dataframes or saving the intermediate results, adding a dataframe and then resuming it? (cpt Obvious plugin: for mathematical reasons I can't just average the last n-1 means and std dev values -- theoretically I could average the means if a number of samples in every df were equal, but it's not)

(Obviously, I do not have entire past dataset available at once - every minute 1 new df is incoming and the df number older than n minutes is "deleted" from calculation.)

Andrey Shokhin · Accepted Answer

You can calculate Mean (M), Second Moment (M2) and Std (D) for each dataframe and when you need to aggregate some of them you can use the properties of this statistics:

m_i = len(X_i)

M(X_i) = sum(x for x in X_i) / m_i

M2(X_i) = sum(x ** 2 for x in X_i) / m_i

M(X1,X2,...Xn) = sum(M(X_i) * m_i) / sum(m_i)

M2(X1,X2,...Xn) = sum(M2(X_i) * m_i ) / sum(m_i)

D(X1, X2,...Xn) = M2(X1,X2,...Xn) - M(X1,X2,...Xn) ** 2

Then Std = sqrt(D)

where m_i - number of observations in X_i sample

for more information see wiki

Recalculating mean and std avg (Python, Pandas)

Answers (2)

Related Questions