Reputation: 6786
The problem: I need to recalculate mean and std dev for last n minutes every minute.
That is, if we assume n == 3
, then I have 3 data frames for say minutes 12:01, 12:02, 12:03. At 12:04 calculate mean, std dev for those last 3 minutes.
At 12:05 I need to recalculate mean and std dev of dataframes for 12:02, 12:03, 12:04.
Now I can concat
the last 3 dataframes on a new minute passing and then calculate what I need. But that means I'm needlessly recalculating every dataframe n-1
times.
Is there a way of "suspending" computation on dataframes or saving the intermediate results, adding a dataframe and then resuming it? (cpt Obvious plugin: for mathematical reasons I can't just average the last n-1
means and std dev values -- theoretically I could average the means if a number of samples in every df were equal, but it's not)
(Obviously, I do not have entire past dataset available at once - every minute 1 new df is incoming and the df number older than n
minutes is "deleted" from calculation.)
Upvotes: 1
Views: 509
Reputation: 10139
You can do something like:
rolling = numpy.zeros(n)
for i, minute_df in enumerate(new_df):
rolling[i % n] = minute_df.mean()
print rolling.mean()
Upvotes: 0
Reputation: 12192
You can calculate Mean (M), Second Moment (M2) and Std (D)
for each dataframe and when you need to aggregate some of them you can use the properties of this statistics:
m_i = len(X_i)
M(X_i) = sum(x for x in X_i) / m_i
M2(X_i) = sum(x ** 2 for x in X_i) / m_i
M(X1,X2,...Xn) = sum(M(X_i) * m_i) / sum(m_i)
M2(X1,X2,...Xn) = sum(M2(X_i) * m_i ) / sum(m_i)
D(X1, X2,...Xn) = M2(X1,X2,...Xn) - M(X1,X2,...Xn) ** 2
Then Std = sqrt(D)
where m_i - number of observations in X_i sample
for more information see wiki
Upvotes: 1