Ben Hernandez
Ben Hernandez

Reputation: 867

Quickest way to return snapshots of Exponential Moving Average in postgresql

I am collecting and graphing data and one of the things I need to do is calculate an exponential moving average for some of my data. I have my data stored in postgres.

Based on another stack page I read (How to calculate an exponential moving average on postgres?) I have the following function.

CREATE OR REPLACE FUNCTION ema_func(
state double precision,
inval double precision,
alpha double precision)
RETURNS double precision AS
$BODY$
begin
  return case
     when state is null then inval
     else alpha * inval + (1-alpha) * state
     end;
end
$BODY$
LANGUAGE plpgsql VOLATILE

which I am then using a an aggregate like so to put it all together:

CREATE AGGREGATE ema(double precision, double precision) (
  SFUNC=ema_func,
  STYPE=float8
);

I am graphing stock information so for a given day I have about 7000-8000 pieces of data. I don't need all of that information to graph the data (depending on my window settings, 1 pixel might be worth somewhere around 60 seconds) so I want to pull snapshots of my data every nth second. I wrote this function to do that for me and it has saved me some time.

CREATE OR REPLACE FUNCTION emasnapshots(
    ptable varchar,
    timestart timestamptz,
    timeend timestamptz,
    duration double precision,
    psymbol varchar,
    alpha double precision)
    returns setof timevalue as
$BODY$
    DECLARE
        localstart timestamptz;
        localend timestamptz;
        timevalues timevalue%rowtype;
        groups int := ceil((SELECT EXTRACT(EPOCH FROM (timeend - timestart))) / duration);
    BEGIN
        EXECUTE 'CREATE TEMP TABLE allemas ON COMMIT DROP AS select datetime, ema(value, ' || quote_literal(alpha) || ') over (order by datetime asc) from ' || quote_ident(ptable) || ' where symbol = ' || quote_literal(psymbol) || ' and datetime >= ' || quote_literal(timestart) || ' and datetime <= ' || quote_literal(timeend);
        FOR i in 1 .. groups LOOP
            localStart := timestart + (duration * (i - 1) * interval '1 second');
            localEnd := timestart + (duration * i * interval '1 second');
            EXECUTE 'select * from allemas where datetime >= ' || quote_literal(localstart) || ' and datetime <= ' || quote_literal(localend) || ' order by datetime desc limit 1' into timevalues;
            return next timevalues;
        end loop;
    return;
    END
$BODY$
LANGUAGE plpgsql VOLATILE

Running just my EMA with

select datetime::timestamptz, ema(value, 0.0952380952380952380952380952381 /* alpha */) over (order by datetime asc) from "4" where symbol = 'AAPL' and datetime >= '2015-07-01 7:30' and datetime <= '2015-07-01 14:00:00'

takes about 1.5 seconds to collect all of the data (7733 rows) and push it across the internet (my data is in another state)

Running the emasnapshot function I wrote with

select start, average from emasnapshots ('4', '2015-07-01 9:30-4', '2015-07-01 16:00-4', 60, 'AAPL', 0.0952380952380952380952380952381);

takes about 0.5 seconds to gather all of the data and push it across the internet (390 rows) btw, for clarity. I am pulling from table "4" for July 1st during stock market hours and I want snapshots every 60 seconds. The last number is my alpha and it means I am calculating the 20 second emas (alpha = 2/(period + 1))

My question is, am I doing this the fastest way possible? Is there a way to tell which part of my function is the slower part? Like is it the temp table creation or the grabbing the snapshot part? Should I be selecting the most recent date in an interval a different way? Should I be selecting the latest time in my interval from my original table (which is indexed on time) and joining that with my newly created table?

I just started writing postgres functions about a week ago. I realize that my newly created table is not indexed and so it might take a bit longer to do date-related stuff like I am asking it to do. Is there a way around this? I am dealing with lots of days worth of data with lots of different symbols so I am not sure that creating ema tables for all possibilities is a good idea. I don't want to suck all of the data down and do processing locally because if the graphing software has multiple days open, that could easily encompass 35,000 lines that would have to be transferred then processed.

Btw, I don't think it is indexing speed or anything like that because I can run:

select * from "4" where symbol = 'AAPL' and datetime >= '2015-07-01 07:30' and datetime <= '2015-07-01 14:00' order by datetime asc limit 450

and get a response in under 150ms over the internet. Obviously there is way less processing in this though.

Thank you so much for your time!

EDITED BASED ON PATRICK'S ANSWER.

I now have the query below which I modified from what Patrick said:

SELECT datetime, ema FROM (
    SELECT datetime, ema, rank() OVER (PARTITION BY bucket ORDER BY datetime DESC) = 1 as rank
        FROM (
        SELECT datetime, ema(value, 0.0952380952380952380952380952381) OVER (ORDER BY datetime ASC) AS ema, 
            ceil(extract(epoch from (datetime - '2015-07-01 7:30')) / 60) AS bucket
        FROM "4" 
        WHERE symbol = 'AAPL' 
        AND datetime BETWEEN '2015-07-01 7:30' AND '2015-07-01 14:00' ) x ) y
WHERE rank = true;

Because I was getting the error that I couldn't put a rank statement in a where clause so I split it into a different select statement, did I do this right? Having three select statements feels odd to me but I am an SQL newbie and trying to learn as I go so maybe that isn't that bad.

My explain statement for the above query looks like this.

Subquery Scan on y  (cost=6423.35..6687.34 rows=4062 width=16)
  Filter: y.rank
   ->  WindowAgg  (cost=6423.35..6606.11 rows=8123 width=24)
    ->  Sort  (cost=6423.35..6443.65 rows=8123 width=24)
          Sort Key: x.bucket, x.datetime
          ->  Subquery Scan on x  (cost=5591.23..5895.85 rows=8123 width=24)
                ->  WindowAgg  (cost=5591.23..5814.62 rows=8123 width=16)
                      ->  Sort  (cost=5591.23..5611.54 rows=8123 width=16)
                            Sort Key: "4".datetime
                            ->  Bitmap Heap Scan on "4"  (cost=359.99..5063.74 rows=8123 width=16)
                                  Recheck Cond: (((symbol)::text = 'AAPL'::text) AND (datetime >= '2015-07-01 07:30:00-06'::timestamp with time zone) AND (datetime <= '2015-07-01 14:00:00-06'::timestamp with time zone))
                                  ->  Bitmap Index Scan on "4_pkey"  (cost=0.00..357.96 rows=8123 width=0)
                                        Index Cond: (((symbol)::text = 'AAPL'::text) AND (datetime >= '2015-07-01 07:30:00-06'::timestamp with time zone) AND (datetime <= '2015-07-01 14:00:00-06'::timestamp with time zone))

Upvotes: 1

Views: 722

Answers (1)

Patrick
Patrick

Reputation: 32316

First, a few notes about minor efficiency issues with your function:

  • You do not have to quote_literal() anything other than strings. It is impossible that Bobby Tables is injected into your SQL statement through a double precision or timestamp parameter.
  • In a dynamic SQL statement you only have to manually splice in table and column names; parameter values can be injected using the USING clause. This saves a lot of parsing time.
  • Move as many calculations as possible outside of your loop. For instance:
DECLARE
  ...
  dur_int interval := duration * interval '1 second';
  localStart timestamptz := timestart;
  localEnd timestamptz := localStart + dur_int;
BEGIN
  ...
  FOR i in 1 .. groups LOOP
    ...
    localStart := localStart + dur_int;
    localEnd := localEnd + dur_int;
  END LOOP;
  ...

But this is really all moot...

In your code you first populate a temp table with 7733 rows of data, from which you later extract one record at a time with a dynamic query in a loop that runs 390 times. All very, very wasteful. You can replace the entire function body by just a single statement:

RETURN QUERY EXECUTE format('SELECT datetime, ema '
   'FROM ('
     'SELECT datetime, ema, '
            'rank() OVER (PARTITION BY bucket ORDER BY datetime DESC) AS rank '
     'FROM ('
       'SELECT datetime, ema(value, $1) OVER (ORDER BY datetime ASC) AS ema, '
              'ceil(extract(epoch from (datetime - $2)) / $3) AS bucket '
       'FROM %I ' 
       'WHERE symbol = $4 '
         'AND datetime BETWEEN $2 AND $5) x '
     'WHERE rank = 1) y '
   'ORDER BY 1', ptable) USING alpha, timestart, duration, psymbol, timeend;

The principle here is that in the innermost query you calculate the "bucket" into which each processed row from the table would fall. In the next level query you calculate the rank of all rows in each bucket based on the datetime. In the main query you then pick the most recent row from each bucket, i.e. the one where rank = 1.

On speed: you should really do an EXPLAIN on all queries on the server, rather than measure on the client which includes network transfer time.

Upvotes: 1

Related Questions