Reputation: 2092
I'm performing full text search within PgSQL with queries like that (the aim is to return list of geocoded cities depending on searched city name):
SELECT *
FROM cities
WHERE city_tsvector @@ to_tsquery('Paris')
AND postcode LIKE '75%'
This query performs quite fast on my database (425626 entries in the cities table): around 100ms. So far, so good.
Now, I have to perform this search, on 400 cities at the same time.
400 x 100ms = 40 seconds, which is way too long for my users.
I'm trying to write a single query, to perform this search in one go. One detail: the cities I must search for are not stored in database.
So I wrote this kind of query :
SELECT DISTINCT ON (myid) *
FROM unnest(
array[19977,19978,19979, (and so on)]::int[],
array['SAULXURES','ARGENTEUIL','OBERHOFFEN&SUR&MODER', (and so on)]::text[],
array['67','95','67','44', (and so on))]::text[]
) AS t(myid, cityname,mypostcode)
LEFT JOIN cities gc2 ON gc2.city_tsvector @@ to_tsquery(cityname) AND gc2.postcode LIKE CONCAT(mypostcode,'%')
ORDER BY myid
;
The result is just catastrophic: for the same searches, the query is taking 4 times slower!
Is it possible to perform this kind of query so that the execution takes less time?
Thanks
EDIT
Here is the table cities structure (425626 lines):
EDIT with @The-Impaler answer:
query takes 11 seconds
EXPLAIN VERBOSE :
Unique (cost=71133.21..71138.53 rows=100 width=40) Output: t.myid, (st_astext(gc2.gps_coordinates)) -> Sort (cost=71133.21..71135.87 rows=1064 width=40) Output: t.myid, (st_astext(gc2.gps_coordinates)) Sort Key: t.myid -> Hash Right Join (cost=2.26..71079.72 rows=1064 width=40) Output: t.myid, st_astext(gc2.gps_coordinates) Hash Cond: (left((gc2.postcode)::text, 2) = t.mypostcode) Join Filter: (gc2.city_tsvector @@ to_tsquery(t.cityname)) -> Seq Scan on public.geo_cities gc2 (cost=0.00..13083.26 rows=425626 width=69) Output: gc2.id, gc2.country_code, gc2.city, gc2.postcode, gc2.gps_coordinates, gc2.administrative_level_1_name, gc2.administrative_level_1_code, gc2.administrative_level_2_name, gc2.administrative_level_2_code, gc2.administrative_level_ (...) -> Hash (cost=1.01..1.01 rows=100 width=72) Output: t.myid, t.cityname, t.mypostcode -> Function Scan on t (cost=0.01..1.01 rows=100 width=72) Output: t.myid, t.cityname, t.mypostcode Function Call: unnest('{289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,28914887922 (...)
query takes 12 seconds
EXPLAIN VERBOSE:
Unique (cost=71665.25..71670.57 rows=100 width=40) Output: t.myid, (st_astext(gc2.gps_coordinates)) -> Sort (cost=71665.25..71667.91 rows=1064 width=40) Output: t.myid, (st_astext(gc2.gps_coordinates)) Sort Key: t.myid -> Hash Right Join (cost=2.26..71611.75 rows=1064 width=40) Output: t.myid, st_astext(gc2.gps_coordinates) Hash Cond: ((substring((gc2.postcode)::text, 1, 2))::text = t.mypostcode) Join Filter: (gc2.city_tsvector @@ to_tsquery(t.cityname)) -> Seq Scan on public.geo_cities gc2 (cost=0.00..13083.26 rows=425626 width=69) Output: gc2.id, gc2.country_code, gc2.city, gc2.postcode, gc2.gps_coordinates, gc2.administrative_level_1_name, gc2.administrative_level_1_code, gc2.administrative_level_2_name, gc2.administrative_level_2_code, gc2.administrative_level_ (...) -> Hash (cost=1.01..1.01 rows=100 width=72) Output: t.myid, t.cityname, t.mypostcode -> Function Scan on t (cost=0.01..1.01 rows=100 width=72) Output: t.myid, t.cityname, t.mypostcode Function Call: unnest('{289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,28914887922 (...)
Upvotes: 0
Views: 158
Reputation: 48770
PART ONE - Enforcing Index Range Scan on postcode
.
The key optimization is to make each search to work on 4000 rows and not 400k rows, This should be easy if the filtering by the first two digits of postcode is done appropriately.
Option #1: Create an index on the first two characters of postcode:
create index ix1 on cities (substring(postcode from 1 for 2));
Option #2: Use =
instead of LIKE
when filtering by postcode
. You'll need to create a pseudo-column minipostcode
and create an index on it:
create or replace function minipostcode(cities)
returns char(2) as $$
select substring($1.postcode from 1 for 2)
$$ stable language sql;
create index ix_cities_minipostcode on cities(minipostcode(cities));
And your SQL will change to:
SELECT DISTINCT ON (myid) *
FROM unnest(
array[19977,19978,19979, (and so on)]::int[],
array['SAULXURES','ARGENTEUIL','OBERHOFFEN&SUR&MODER', (and so on)]::text[],
array['67','95','67','44', (and so on))]::text[]
) AS t(myid, cityname,mypostcode)
LEFT JOIN cities gc2 ON gc2.city_tsvector @@ to_tsquery(cityname)
AND gc2.minipostcode = mypostcode
ORDER BY myid
;
See the gc2.minipostcode =
there?
Check the execution plan
Get the execution plan for each one of the options above and compare using:
explain verbose <my-query>
Please post the execution plan of both options.
PART TWO - Reducing the number of scans.
Once you are sure it's using Index Range Scan on postcode
you can further optimize it.
Considering your search has 400 values, each postcode
is probably repeated around 4 times each. Then, do a single Index Range Scan for each postcode, You'll reduce your execution time by 75% just by doing this.
However, you can't do this using pure SQL, you'll need to pre-process your SQL generating a single query per postcode
. You won't use unnest
anymore but you'll prebuild the SQL in your application language.
For example, since 67
shows up twice in your example it should be combined into a single Index Range Scan, resulting in something like:
select from cities gc2
where (gc2.city_tsvector @@ to_tsquery('SAULXURES')
or gc2.city_tsvector @@ to_tsquery('OBERHOFFEN&SUR&MODER'))
and gc2.minipostcode = '67'
union
(next select for another postcode here, and so on...)
This is much more optimized than your unnest
option since it executes at most 100 Index Range Scans, even if you increase your search to 1000 or 5000 criteria.
Try this, and post the new execution plan.
Upvotes: 1