frinux
frinux

Reputation: 2092

Perform multiples PgSQL full text search in one query

I'm performing full text search within PgSQL with queries like that (the aim is to return list of geocoded cities depending on searched city name):

SELECT * 
FROM   cities 
WHERE  city_tsvector @@ to_tsquery('Paris') 
AND    postcode LIKE '75%'

This query performs quite fast on my database (425626 entries in the cities table): around 100ms. So far, so good.

Now, I have to perform this search, on 400 cities at the same time.

400 x 100ms = 40 seconds, which is way too long for my users.

I'm trying to write a single query, to perform this search in one go. One detail: the cities I must search for are not stored in database.

So I wrote this kind of query :

SELECT DISTINCT ON (myid) *
FROM unnest(
    array[19977,19978,19979, (and so on)]::int[],
    array['SAULXURES','ARGENTEUIL','OBERHOFFEN&SUR&MODER', (and so on)]::text[],
    array['67','95','67','44', (and so on))]::text[]
) AS t(myid, cityname,mypostcode)
 LEFT JOIN cities gc2 ON gc2.city_tsvector @@ to_tsquery(cityname) AND gc2.postcode LIKE CONCAT(mypostcode,'%')
 ORDER BY myid
;

The result is just catastrophic: for the same searches, the query is taking 4 times slower!

Is it possible to perform this kind of query so that the execution takes less time?

Thanks

EDIT

Here is the table cities structure (425626 lines):

table structure

EDIT with @The-Impaler answer:

query takes 11 seconds

EXPLAIN VERBOSE :


    Unique  (cost=71133.21..71138.53 rows=100 width=40)
      Output: t.myid, (st_astext(gc2.gps_coordinates))
      ->  Sort  (cost=71133.21..71135.87 rows=1064 width=40)
            Output: t.myid, (st_astext(gc2.gps_coordinates))
            Sort Key: t.myid
            ->  Hash Right Join  (cost=2.26..71079.72 rows=1064 width=40)
                  Output: t.myid, st_astext(gc2.gps_coordinates)
                  Hash Cond: (left((gc2.postcode)::text, 2) = t.mypostcode)
                  Join Filter: (gc2.city_tsvector @@ to_tsquery(t.cityname))
                  ->  Seq Scan on public.geo_cities gc2  (cost=0.00..13083.26 rows=425626 width=69)
                        Output: gc2.id, gc2.country_code, gc2.city, gc2.postcode, gc2.gps_coordinates, gc2.administrative_level_1_name, gc2.administrative_level_1_code, gc2.administrative_level_2_name, gc2.administrative_level_2_code, gc2.administrative_level_ (...)
                  ->  Hash  (cost=1.01..1.01 rows=100 width=72)
                        Output: t.myid, t.cityname, t.mypostcode
                        ->  Function Scan on t  (cost=0.01..1.01 rows=100 width=72)
                              Output: t.myid, t.cityname, t.mypostcode
                              Function Call: unnest('{289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,28914887922 (...)

query takes 12 seconds

EXPLAIN VERBOSE:


    Unique  (cost=71665.25..71670.57 rows=100 width=40)
      Output: t.myid, (st_astext(gc2.gps_coordinates))
      ->  Sort  (cost=71665.25..71667.91 rows=1064 width=40)
            Output: t.myid, (st_astext(gc2.gps_coordinates))
            Sort Key: t.myid
            ->  Hash Right Join  (cost=2.26..71611.75 rows=1064 width=40)
                  Output: t.myid, st_astext(gc2.gps_coordinates)
                  Hash Cond: ((substring((gc2.postcode)::text, 1, 2))::text = t.mypostcode)
                  Join Filter: (gc2.city_tsvector @@ to_tsquery(t.cityname))
                  ->  Seq Scan on public.geo_cities gc2  (cost=0.00..13083.26 rows=425626 width=69)
                        Output: gc2.id, gc2.country_code, gc2.city, gc2.postcode, gc2.gps_coordinates, gc2.administrative_level_1_name, gc2.administrative_level_1_code, gc2.administrative_level_2_name, gc2.administrative_level_2_code, gc2.administrative_level_ (...)
                  ->  Hash  (cost=1.01..1.01 rows=100 width=72)
                        Output: t.myid, t.cityname, t.mypostcode
                        ->  Function Scan on t  (cost=0.01..1.01 rows=100 width=72)
                              Output: t.myid, t.cityname, t.mypostcode
                              Function Call: unnest('{289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,28914887922 (...)

Upvotes: 0

Views: 158

Answers (1)

The Impaler
The Impaler

Reputation: 48770

PART ONE - Enforcing Index Range Scan on postcode.

The key optimization is to make each search to work on 4000 rows and not 400k rows, This should be easy if the filtering by the first two digits of postcode is done appropriately.

Option #1: Create an index on the first two characters of postcode:

create index ix1 on cities (substring(postcode from 1 for 2));

Option #2: Use = instead of LIKE when filtering by postcode. You'll need to create a pseudo-column minipostcode and create an index on it:

create or replace function minipostcode(cities)
returns char(2) as $$
  select substring($1.postcode from 1 for 2)
$$ stable language sql;

create index ix_cities_minipostcode on cities(minipostcode(cities));

And your SQL will change to:

SELECT DISTINCT ON (myid) *
FROM unnest(
    array[19977,19978,19979, (and so on)]::int[],
    array['SAULXURES','ARGENTEUIL','OBERHOFFEN&SUR&MODER', (and so on)]::text[],
    array['67','95','67','44', (and so on))]::text[]
) AS t(myid, cityname,mypostcode)
 LEFT JOIN cities gc2 ON gc2.city_tsvector @@ to_tsquery(cityname) 
       AND gc2.minipostcode = mypostcode
 ORDER BY myid
;

See the gc2.minipostcode = there?

Check the execution plan

Get the execution plan for each one of the options above and compare using:

explain verbose <my-query>

Please post the execution plan of both options.

PART TWO - Reducing the number of scans.

Once you are sure it's using Index Range Scan on postcode you can further optimize it.

Considering your search has 400 values, each postcode is probably repeated around 4 times each. Then, do a single Index Range Scan for each postcode, You'll reduce your execution time by 75% just by doing this.

However, you can't do this using pure SQL, you'll need to pre-process your SQL generating a single query per postcode. You won't use unnest anymore but you'll prebuild the SQL in your application language.

For example, since 67 shows up twice in your example it should be combined into a single Index Range Scan, resulting in something like:

select from cities gc2 
  where (gc2.city_tsvector @@ to_tsquery('SAULXURES') 
     or gc2.city_tsvector @@ to_tsquery('OBERHOFFEN&SUR&MODER')) 
    and gc2.minipostcode = '67'
union
(next select for another postcode here, and so on...)

This is much more optimized than your unnest option since it executes at most 100 Index Range Scans, even if you increase your search to 1000 or 5000 criteria.

Try this, and post the new execution plan.

Upvotes: 1

Related Questions