Reputation: 705
I have two tables and i have to query my postgresql database. The table 1 has about 140 million records and table 2 has around 50 million records of the following.
the table 1 has the following structure:
tr_id bigint NOT NULL, # this is the primary key
query_id numeric(20,0), # indexed column
descrip_id numeric(20,0) # indexed column
and table 2 has the following structure
query_pk bigint # this is the primary key
query_id numeric(20,0) # indexed column
query_token numeric(20,0)
The sample db of table1 would be
1 25 96
2 28 97
3 27 98
4 26 99
The sample db of table2 would be
1 25 9554
2 25 9456
3 25 9785
4 25 9514
5 26 7412
6 26 7433
7 27 545
8 27 5789
9 27 1566
10 28 122
11 28 1456
I am preferring queries in which i would be able to query in blocks of tr_id. In range of 10,000 as this is my requirement.
I would like to get output in the following manner
25 {9554,9456,9785,9514}
26 {7412,7433}
27 {545,5789,1566}
28 {122,1456}
I tried in the following manner
select query_id,
array_agg(query_token)
from sch.table2
where query_id in (select query_id
from sch.table1
where tr_id between 90001 and 100000)
group by query_id
I am performing the following query which takes about 121346 ms and when some 4 such queries are fired it still takes longer time. Can you please help me to optimise the same.
I have a machine which runs on windows 7 with i7 2nd gen proc with 8GB of RAM.
The following is my postgresql configuration
shared_buffers = 1GB
effective_cache_size = 5000MB
work_mem = 2000MB
What should I do to optimise it.
Thanks
EDIT : it would be great if the results ordered according to the following format
25 {9554,9456,9785,9514}
28 {122,1456}
27 {545,5789,1566}
26 {7412,7433}
ie according to the order of the queryid present in table1 ordered by tr_id. If this is computationally expensive may be in the client code i would try to optimise it. But I am not sure how efficient it would be.
Thanks
Upvotes: 2
Views: 771
Reputation: 656251
I expect a JOIN
to be much faster that the IN
condition you have presently:
SELECT t2.query_id
,array_agg(t2.query_token) AS tokens
FROM t1
JOIN t2 USING (query_id)
WHERE t1.tr_id BETWEEN 1 AND 10000
GROUP BY t1.tr_id, t2.query_id
ORDER BY t1.tr_id;
This also sorts the results as requested. query_token
remains unsorted per query_id
.
Obviously you need indexes on t1.tr_id
and t2.query_id
.
You obviously have that one already:
CREATE INDEX t2_query_id_idx ON t2 (query_id);
A multicolumn index on t1
may improve performance (you'll have to test):
CREATE INDEX t1_tr_id_query_id_idx ON t1 (tr_id, query_id);
If this is a dedicated database server, you can raise the setting for effective_cache_size
some more.
@Frank already gave advise on work_mem
. I quote the manual:
Note that for a complex query, several sort or hash operations might be running in parallel; each operation will be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem;
It should be just big enough to be able to sort your queries in RAM. 10 MB is more than plenty to hold 10000 of your rows at a time. Set it higher, if you have queries that need more at a time.
With 8 GB on a dedicated database server, I would be tempted to set shared_buffers
to at least 2 GB.
shared_buffers = 2GB
effective_cache_size = 7000MB
work_mem = 10MB
More advice on performance tuning in the Postgres Wiki.
Upvotes: 2