Query Optimization

Question

That's my current query, it works but it is slow:

    SELECT row, MIN(flg) ||' to ' ||Max (flg) as xyz , avg(amt_won), count(*)
FROM(
SELECT (ROW_NUMBER() OVER (ORDER BY flg))*100/
(SELECT  count(*)+100 as temprow FROM temporary_six_max) as row, flg, amt_won
FROM temporary_six_max 
    JOIN (
    SELECT id_player AS pid,  avg(flg_vpip::int)  AS flg 
    FROM temporary_six_max
    GROUP BY id_player 
    ) AS auxtable
    ON pid = id_player
) as auxtable2
group by 1
order by 1;

I am grouping in fixed (or almost fixed) count 100 ranges that are ordered by avg(flg_vpip) grouped by id_player.

Here I've pasted the results in case it may help to understand: https://spreadsheets0.google.com/ccc?key=tFVsxkWVn4fMWYBxxGYokwQ&authkey=CNDvuOcG&authkey=CNDvuOcG#gid=0

I wonder if there is a better function to use than ROW_NUMBER() in this case and I feel like I am doing too many subselects but I don't know how to optimize it.

I'll appreciate very much any help.

If something is not clear just let me know.

Thank you.

EDIT:

The reason I created auxtable 2, is because when I use (ROW_NUMBER() OVER (ORDER BY flg), and use other agregate commands such as avg(amt_won) and count(*), which are essential, I get an error saying that flg should be in the aggregate function, but I can't order by an aggregate function of flg.

araqnid · Accepted Answer

I generated some data to test with like this:

create table temporary_six_max as
select id_player, flg_vpip,
       random()*100 * (case flg_vpip when 0 then 1 else -1 end) as amt_won
from (select (random()*1000)::int as id_player, random()::int as flg_vpip
      from generate_series(1,1000000)) source;
create index on temporary_six_max(id_player);

Your query runs successfully against that, but doesn't quite generate the same plan, I get a nested loop in the lower arm rather than a merge and a seq scan in the init-plan-- you haven't turned off enable_seqscan I hope?

A solution just using a single scan of the table:

select row, min(flg) || ' to ' || max(flg) as xyz, avg(amt_won), count(*)
from (select flg, amt_won, ntile(100) over(order by flg) as row
      from (select id_player as pid, amt_won,
                   avg(flg_vpip::int) over (partition by id_player) as flg
            from temporary_six_max
           ) player_stats
     ) chunks
group by 1
order by 1

The bad news is that this actually performs worse on my machine, especially if I bump work_mem up enough to avoid the first disk sort (making player_stats, sorting by flg). Although increasing work_mem did halve the query time, so I guess that is at least a start?

Having said that, my queries are running for about 5 seconds to process 10E6 input rows in temporary_six_max, which is an order of magnitude faster than you posted. Does your table fit into your buffer cache? If not, a single-scan solution may be much better for you. (Which version of Postgresql are you using? "explain (analyze on, buffers on) select..." will show you buffer hit/miss rates in 9.0, or just look at your "shared_buffers" setting and compare with the table size)

Query Optimization

Answers (1)

Related Questions