user2171391
user2171391

Reputation:

Query long lists

I would like to query the value of an exponentially weighted moving average at particular points. An inefficient way to do this is as follows. l is the list of times of events and queries has the times at which I want the value of this average.

a=0.01
l = [3,7,10,20,200]
y = [0]*1000
for item in l:
        y[int(item)]=1
s = [0]*1000
for i in xrange(1,1000):
    s[i] = a*y[i-1]+(1-a)*s[i-1]

queries = [23,68,103]

for q in queries:
        print s[q]

Outputs:

0.0355271185019
0.0226018371526
0.0158992102478

In practice l will be very large and the range of values in l will also be huge. How can you find the values at the times in queries more efficiently, and especially without computing the potentially huge lists y and s explicitly. I need it to be in pure python so I can use pypy.

Is it possible to solve the problem in time proportional to len(l) and not max(l) (assuming len(queries) < len(l))?

Upvotes: 1

Views: 112

Answers (3)

Nicu Stiurca
Nicu Stiurca

Reputation: 8697

Here is my code for doing this:

def ewma(l, queries, a=0.01):
  def decay(t0, x, t1, a):
    from math import pow
    return pow((1-a), (t1-t0))*x

  assert l == sorted(l)
  assert queries == sorted(queries)

  samples = []
  try:
    t0, x0 = (0.0, 0.0)
    it = iter(queries)
    q = it.next()-1.0

    for t1 in l:
      # new value is decayed previous value, plus a
      x1 = decay(t0, x0, t1, a) + a
      # take care of all queries between t0 and t1
      while q < t1:
        samples.append(decay(t0, x0, q, a))
        q = it.next()-1.0
      # take care of all queries equal to t1
      while q == t1:
        samples.append(x1)
        q = it.next()-1.0
      # update t0, x0
      t0, x0 = t1, x1

    # take care of any remaining queries
    while True:
      samples.append(decay(t0, x0, q, a))
      q = it.next()-1.0
  except StopIteration:
    return samples

I've also uploaded a fuller version of this code with unit tests and some comments to pastebin: http://pastebin.com/shhaz710

EDIT: Note that this does the same thing as what Chris Pak suggests in his answer, which he must have posted as I was typing this. I haven't gone through the details of his code, but I think mine is a bit more general. This code supports non-integer values in l and queries. It also works for any kind of iterables, not just lists since I don't do any indexing.

Upvotes: 1

Chris Pak
Chris Pak

Reputation: 121

I think you could do it in ln(l) time, if l is sorted. The basic idea is that the non recursive form of EMA is a*s_i + (1-a)^1 * s_(i-1) + (1-a)^2 * s_(i-2) ....

This means for query k, you find the greatest number in l less than k, and for a estimation limit, use the following, where v is the index in l, l[v] is the value

(1-a)^(k-v) *l[v] + ....

Then, you spend lg(len(l)) time in search + a constant multiple for the depth of your estimation. I'll provide a code sample in a little bit (after work) if you want it, just wanted to get my idea out there while I was thinking about it

here's the code - v is the dictionary of values at a given time; replace with 1 if it's just a 1 every time...

import math
from bisect import bisect_right

a = .01
limit = 1000
l = [1,5,14,29...]

def find_nearest_lt(l, time):
    i = bisect_right(a, x)
    if i:
        return i-1
    raise ValueError

def find_ema(l, time):
    i = find_nearest_lt(l, time)
    if l[i] == time:
        result = a * v[l[i]
        i -= 1
    else:
        result = 0
    while (time-l[i]) < limit:
        result += math.pow(1-a, time-l[i]) * v[l[i]]
        i -= 1
    return result

if I'm thinking correctly, the find nearest is l(n), then the while loop is <= 1000 iterations, guaranteed, so it's technically a constant (though a kind of large one). find_nearest was stolen from the page on bisect - http://docs.python.org/2/library/bisect.html

Upvotes: 1

Brent Washburne
Brent Washburne

Reputation: 13158

It appears that y is a binary value -- either 0 or 1 -- depending on the values of l. Why not use y = set(int(item) for item in l)? That's the most efficient way to store and look up a list of numbers.

Your code will cause an error the first time through this loop:

s = [0]*1000
for i in xrange(1000):
    s[i] = a*y[i-1]+(1-a)*s[i-1]

because i-1 is -1 when i=0 (first pass of loop) and both y[-1] and s[-1] are the last element of the list, not the previous. Maybe you want xrange(1,1000)?

How about this code:

a=0.01
l = [3.0,7.0,10.0,20.0,200.0]
y = set(int(item) for item in l)
queries = [23,68,103]

ewma = []
x = 1 if (0 in y) else 0
for i in xrange(1, queries[-1]):
    x = (1-a)*x
    if i in y:
        x += a
    if i == queries[0]:
        ewma.append(x)
        queries.pop(0)

When it's done, ewma should have the moving averages for each query point.

Edited to include SchighSchagh's improvements.

Upvotes: 0

Related Questions