Florian Seliger
Florian Seliger

Reputation: 441

Matching on postal code and city name - very slow in PostgreSQL

I am trying to update address fields in mytable with data from othertable. If I match on postal codes and search for city names from othertable in mytable, it works reasonably fast. But as I don't have postal codes in all cases, I also want to look for names only in a 2nd query. This takes hours (>12h). Any ideas what I can do to speed up the query? Please note that indexing did not help. Index scans in (2) weren't faster.

Code for matching on postal code + name (1)

update mytable t1 set
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng from (
select * from othertable) t
where t.postal_code = t1.postal_code and     t1.country = t.country
and upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
and admin1 is null;

Code for matching on name only (2)

update mytable t1 set
admin1 = t.admin1,
admin2 = t.admin2,
admin3 = t.admin3,
postal_code = t.postal_code,
lat = t.lat,
lng = t.lng from (
select * from othertable) t
where t1.country = t.country
and upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
and admin1 is null;

Query plan 1:

"Update on mytable t1           (cost=19084169.53..19205622.16 rows=13781     width=1918)"
"  ->  Merge Join  (cost=19084169.53..19205622.16 rows=13781 width=1918)"
"        Merge Cond: (((t1.postal_code)::text = (othertable.postal_code)::text) AND (t1.country = othertable.country))"
"        Join Filter: (upper((t1.address)::text) ~~ (('%'::text || othertable.admin1) || '%'::text))"
"        ->  Sort  (cost=18332017.34..18347693.77 rows=6270570 width=1661)"
"              Sort Key: t1.postal_code, t1.country"
"              ->  Seq Scan on mytable t1  (cost=0.00..4057214.31 rows=6270570 width=1661)"
"                    Filter: (admin1 IS NULL)"
"        ->  Materialize  (cost=752152.19..766803.71 rows=2930305 width=92)"
"              ->  Sort  (cost=752152.19..759477.95 rows=2930305 width=92)"
"                    Sort Key: othertable.postal_code, othertable.country"
"                    ->  Seq Scan on othertable  (cost=0.00..136924.05 rows=2930305 width=92)"

Query plan 2:

"Update on mytable t1     (cost=19084169.53..27246633167.33 rows=5464884210 width=1918)"
"  ->  Merge Join  (cost=19084169.53..27246633167.33 rows=5464884210 width=1918)"
"        Merge Cond: (t1.country = othertable.country)"
"        Join Filter: (upper((t1.address)::text) ~~ (('%'::text || othertable.admin1) || '%'::text))"
"        ->  Sort  (cost=18332017.34..18347693.77 rows=6270570 width=1661)"
"              Sort Key: t1.country"
"              ->  Seq Scan on mytable t1  (cost=0.00..4057214.31 rows=6270570 width=1661)"
"                    Filter: (admin1 IS NULL)"
"        ->  Materialize  (cost=752152.19..766803.71 rows=2930305 width=92)"
"              ->  Sort  (cost=752152.19..759477.95 rows=2930305 width=92)"
"                    Sort Key: othertable.country"
"                    ->  Seq Scan on othertable (cost=0.00..136924.05 rows=2930305 width=92)"

Upvotes: 0

Views: 220

Answers (1)

JGH
JGH

Reputation: 17906

In the second query, you are joining (more or less) on city name, but the othertable has several entries per city name, so you are updating mytable several times per record, with unpredictable value (which lat-long or other admin2/3 will be the last one to be updated?)

If othertable has entries without postal code, use them by adding an extra condition AND othertable.posalcode is null

Else, you will want to get a subset of othertable that returns one row per admin1 + country value. You would replace select * from othertable by the following query. Of course you might want to adjust it to get another lat/long/admin2-3 than the 1st one..

SELECT admin1, country, first(postal_code) postal_code, first(lat) lat, first(lng) lng, first(admin2) admin2, first(admin3) admin3
FROM  othertable 
GROUP BY admin1,country

Worst, the second query overwrite what was updated in the 1st query, so you must ignore these records by adding and mytable.postalcode is null

The entire query could be

UPDATE mytable t1 
SET
    admin1 = t.admin1,
    admin2 = t.admin2,
    admin3 = t.admin3,
    postal_code = t.postal_code,
    lat = t.lat,
    lng = t.lng 
FROM (
    SELECT admin1, country, first(postal_code) postal_code, first(lat) lat, first(lng) lng, first(admin2) admin2, first(admin3) admin3
    FROM  othertable 
    GROUP BY admin1,country) t
WHERE t1.country = t.country
AND upper(t1.address) like '%' || t.admin1 || '%' --looks whether city name from othertable shows up in address in t1
AND admin1 is null
AND mytable.postal_code is null;

Upvotes: 1

Related Questions