Reputation: 867
I am collecting and graphing data and one of the things I need to do is calculate an exponential moving average for some of my data. I have my data stored in postgres.
Based on another stack page I read (How to calculate an exponential moving average on postgres?) I have the following function.
CREATE OR REPLACE FUNCTION ema_func(
state double precision,
inval double precision,
alpha double precision)
RETURNS double precision AS
$BODY$
begin
return case
when state is null then inval
else alpha * inval + (1-alpha) * state
end;
end
$BODY$
LANGUAGE plpgsql VOLATILE
which I am then using a an aggregate like so to put it all together:
CREATE AGGREGATE ema(double precision, double precision) (
SFUNC=ema_func,
STYPE=float8
);
I am graphing stock information so for a given day I have about 7000-8000 pieces of data. I don't need all of that information to graph the data (depending on my window settings, 1 pixel might be worth somewhere around 60 seconds) so I want to pull snapshots of my data every nth second. I wrote this function to do that for me and it has saved me some time.
CREATE OR REPLACE FUNCTION emasnapshots(
ptable varchar,
timestart timestamptz,
timeend timestamptz,
duration double precision,
psymbol varchar,
alpha double precision)
returns setof timevalue as
$BODY$
DECLARE
localstart timestamptz;
localend timestamptz;
timevalues timevalue%rowtype;
groups int := ceil((SELECT EXTRACT(EPOCH FROM (timeend - timestart))) / duration);
BEGIN
EXECUTE 'CREATE TEMP TABLE allemas ON COMMIT DROP AS select datetime, ema(value, ' || quote_literal(alpha) || ') over (order by datetime asc) from ' || quote_ident(ptable) || ' where symbol = ' || quote_literal(psymbol) || ' and datetime >= ' || quote_literal(timestart) || ' and datetime <= ' || quote_literal(timeend);
FOR i in 1 .. groups LOOP
localStart := timestart + (duration * (i - 1) * interval '1 second');
localEnd := timestart + (duration * i * interval '1 second');
EXECUTE 'select * from allemas where datetime >= ' || quote_literal(localstart) || ' and datetime <= ' || quote_literal(localend) || ' order by datetime desc limit 1' into timevalues;
return next timevalues;
end loop;
return;
END
$BODY$
LANGUAGE plpgsql VOLATILE
Running just my EMA with
select datetime::timestamptz, ema(value, 0.0952380952380952380952380952381 /* alpha */) over (order by datetime asc) from "4" where symbol = 'AAPL' and datetime >= '2015-07-01 7:30' and datetime <= '2015-07-01 14:00:00'
takes about 1.5 seconds to collect all of the data (7733 rows) and push it across the internet (my data is in another state)
Running the emasnapshot function I wrote with
select start, average from emasnapshots ('4', '2015-07-01 9:30-4', '2015-07-01 16:00-4', 60, 'AAPL', 0.0952380952380952380952380952381);
takes about 0.5 seconds to gather all of the data and push it across the internet (390 rows) btw, for clarity. I am pulling from table "4" for July 1st during stock market hours and I want snapshots every 60 seconds. The last number is my alpha and it means I am calculating the 20 second emas (alpha = 2/(period + 1))
My question is, am I doing this the fastest way possible? Is there a way to tell which part of my function is the slower part? Like is it the temp table creation or the grabbing the snapshot part? Should I be selecting the most recent date in an interval a different way? Should I be selecting the latest time in my interval from my original table (which is indexed on time) and joining that with my newly created table?
I just started writing postgres functions about a week ago. I realize that my newly created table is not indexed and so it might take a bit longer to do date-related stuff like I am asking it to do. Is there a way around this? I am dealing with lots of days worth of data with lots of different symbols so I am not sure that creating ema tables for all possibilities is a good idea. I don't want to suck all of the data down and do processing locally because if the graphing software has multiple days open, that could easily encompass 35,000 lines that would have to be transferred then processed.
Btw, I don't think it is indexing speed or anything like that because I can run:
select * from "4" where symbol = 'AAPL' and datetime >= '2015-07-01 07:30' and datetime <= '2015-07-01 14:00' order by datetime asc limit 450
and get a response in under 150ms over the internet. Obviously there is way less processing in this though.
Thank you so much for your time!
I now have the query below which I modified from what Patrick said:
SELECT datetime, ema FROM (
SELECT datetime, ema, rank() OVER (PARTITION BY bucket ORDER BY datetime DESC) = 1 as rank
FROM (
SELECT datetime, ema(value, 0.0952380952380952380952380952381) OVER (ORDER BY datetime ASC) AS ema,
ceil(extract(epoch from (datetime - '2015-07-01 7:30')) / 60) AS bucket
FROM "4"
WHERE symbol = 'AAPL'
AND datetime BETWEEN '2015-07-01 7:30' AND '2015-07-01 14:00' ) x ) y
WHERE rank = true;
Because I was getting the error that I couldn't put a rank statement in a where clause so I split it into a different select statement, did I do this right? Having three select statements feels odd to me but I am an SQL newbie and trying to learn as I go so maybe that isn't that bad.
My explain statement for the above query looks like this.
Subquery Scan on y (cost=6423.35..6687.34 rows=4062 width=16)
Filter: y.rank
-> WindowAgg (cost=6423.35..6606.11 rows=8123 width=24)
-> Sort (cost=6423.35..6443.65 rows=8123 width=24)
Sort Key: x.bucket, x.datetime
-> Subquery Scan on x (cost=5591.23..5895.85 rows=8123 width=24)
-> WindowAgg (cost=5591.23..5814.62 rows=8123 width=16)
-> Sort (cost=5591.23..5611.54 rows=8123 width=16)
Sort Key: "4".datetime
-> Bitmap Heap Scan on "4" (cost=359.99..5063.74 rows=8123 width=16)
Recheck Cond: (((symbol)::text = 'AAPL'::text) AND (datetime >= '2015-07-01 07:30:00-06'::timestamp with time zone) AND (datetime <= '2015-07-01 14:00:00-06'::timestamp with time zone))
-> Bitmap Index Scan on "4_pkey" (cost=0.00..357.96 rows=8123 width=0)
Index Cond: (((symbol)::text = 'AAPL'::text) AND (datetime >= '2015-07-01 07:30:00-06'::timestamp with time zone) AND (datetime <= '2015-07-01 14:00:00-06'::timestamp with time zone))
Upvotes: 1
Views: 722
Reputation: 32316
First, a few notes about minor efficiency issues with your function:
quote_literal()
anything other than strings. It is impossible that Bobby Tables is injected into your SQL statement through a double precision
or timestamp
parameter.USING
clause. This saves a lot of parsing time.DECLARE
...
dur_int interval := duration * interval '1 second';
localStart timestamptz := timestart;
localEnd timestamptz := localStart + dur_int;
BEGIN
...
FOR i in 1 .. groups LOOP
...
localStart := localStart + dur_int;
localEnd := localEnd + dur_int;
END LOOP;
...
But this is really all moot...
In your code you first populate a temp table with 7733 rows of data, from which you later extract one record at a time with a dynamic query in a loop that runs 390 times. All very, very wasteful. You can replace the entire function body by just a single statement:
RETURN QUERY EXECUTE format('SELECT datetime, ema '
'FROM ('
'SELECT datetime, ema, '
'rank() OVER (PARTITION BY bucket ORDER BY datetime DESC) AS rank '
'FROM ('
'SELECT datetime, ema(value, $1) OVER (ORDER BY datetime ASC) AS ema, '
'ceil(extract(epoch from (datetime - $2)) / $3) AS bucket '
'FROM %I '
'WHERE symbol = $4 '
'AND datetime BETWEEN $2 AND $5) x '
'WHERE rank = 1) y '
'ORDER BY 1', ptable) USING alpha, timestart, duration, psymbol, timeend;
The principle here is that in the innermost query you calculate the "bucket" into which each processed row from the table would fall. In the next level query you calculate the rank of all rows in each bucket based on the datetime
. In the main query you then pick the most recent row from each bucket, i.e. the one where rank = 1
.
On speed: you should really do an EXPLAIN
on all queries on the server, rather than measure on the client which includes network transfer time.
Upvotes: 1