Reputation: 11
My problem is that I have a very slow update query on a table with 14 million rows. I tried different things to tune my server which brought good performance but not for update queries.
I have two tables:
Here is my query and its output:
explain analyse
update T2
set vid=T1.vid
from T1
where stxt2 ~ stxt1 and T2.vid = 0;
Update on T2 (cost=0.00..9037530.59 rows=2814247 width=131) (actual time=25141785.741..25141785.741 rows=0 loops=1) -> Nested Loop (cost=0.00..9037530.59 rows=2814247 width=131) (actual time=32.636..25035782.995 rows=679354 loops=1) Join Filter: ((T2.stxt2)::text ~ (T1.stxt1)::text) -> Seq Scan on T2 (cost=0.00..594772.96 rows=1061980 width=121) (actual time=0.067..5402.614 rows=1037809 loops=1) Filter: (vid= 1) -> Materialize (cost=0.00..17.95 rows=530 width=34) (actual time=0.000..0.069 rows=530 loops=1037809) -> Seq Scan on T1 (cost=0.00..15.30 rows=530 width=34) (actual time=0.019..0.397 rows=530 loops=1) Total runtime: 25141785.904 ms
As you can see the query took approximately 25141 seconds (~ 7 hours). f I understood well, the planner estimates the execution time to be 9037 seconds (~ 2.5 hours). Am I missing something here?
Here are information about my server config:
I have run vacuum full and analyse several times on table T2 but this still does not improve much the situation.
PS: if I set full_page_writes to off, this improves considerably update queries, but I don't want to risk data loss. Do you please have any recommandations?
Upvotes: 1
Views: 3175
Reputation: 8900
Here's an expanded example of my previous comment on a functional index. If you use postgresql and don't know what a functional index is, you're probably suffering because of it.
Let's create a test table put some data in it:
smarlowe=# create table test (a text, b text, c int);
smarlowe=# insert into test select 'abc','z',0 from generate_series(1,1000000); -- 1 million rows that don't match
smarlowe=# insert into test select 'abc','a',0 from generate_series(1,10); -- 10 rows that do match
smarlowe=# insert into test select 'abc','z',1 from generate_series(1,1000000); -- another million rows that won't match.
Now we want to run some queries on it to test:
\timing
select * from test where a ~ b and c=0; -- ignore how long this takes
select * from test where a ~ b and c=0; -- run it twice to get a read with cached data.
On my laptop this takes ~750ms. This classic index on c:
smarlowe=# create index test_c on test(c);
smarlowe=# select * from test where a ~ b and c=0;
takes ~400ms on my laptop.
This functional index tho:
smarlowe=# drop index test_c ;
smarlowe=# create index test_regex on test (c) where (a~b);
smarlowe=# select * from test where a ~ b and c=0;
Now runs in 1.3ms.
Of course there's no such thing as a free lunch, and you WILL pay for this index during updates / inserts.
Upvotes: 1
Reputation: 11
Thank you, this brings some help. So here's what I did:
update urls set hostname=full_url where full_url not like '%/%' and full_url not like 'www\.%';
Time: 112435.192 ms
Then I run this query:
mydb=> explain analyse update urls set vid=vid from T1 where hostname=stxt1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Update on urls (cost=21.93..37758.76 rows=864449 width=124) (actual time=767.793..767.793 rows=0 loops=1)
-> Hash Join (cost=21.93..37758.76 rows=864449 width=124) (actual time=102.324..430.448 rows=94934 loops=1)
Hash Cond: ((urls.hostname)::text = (T1.stxt1)::text)
-> Seq Scan on urls (cost=0.00..25612.52 rows=927952 width=114) (actual time=0.009..265.962 rows=927952 loops=1)
-> Hash (cost=15.30..15.30 rows=530 width=34) (actual time=0.444..0.444 rows=530 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 35kB
-> Seq Scan on T1 (cost=0.00..15.30 rows=530 width=34) (actual time=0.002..0.181 rows=530 loops=1)
Total runtime: 767.860 ms
I was really surprised by the total runtime! less than 1 sec which confirms what you said about updates with exact matches. Now I searched for exacts matches between xtxt1 and stxt2 this way:
mydb=> select count(*) from T2 where vid is null and exists(select null from T1 where stxt1=stxt2);
count
--------
308486
(1 row)
Thus I tried the update on T2 table, and got this:
mydb=> explain analyse update T2 set vid = T1.vid from T1 where T2.vid is null and stxt2=stxt1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Update on T2 (cost=21.93..492023.13 rows=2106020 width=131) (actual time=252395.118..252395.118 rows=0 loops=1)
-> Hash Join (cost=21.93..492023.13 rows=2106020 width=131) (actual time=1207.897..4739.515 rows=308486 loops=1)
Hash Cond: ((T2.stxt2)::text = (T1.stxt1)::text)
-> Seq Scan on T2 (cost=0.00..455452.09 rows=4130377 width=121) (actual time=158.773..3915.379 rows=4103865 loops=1)
Filter: (vid IS NULL)
-> Hash (cost=15.30..15.30 rows=530 width=34) (actual time=0.293..0.293 rows=530 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 35kB
-> Seq Scan on T1 (cost=0.00..15.30 rows=530 width=34) (actual time=0.005..0.121 rows=530 loops=1)
Total runtime: 252395.204 ms
(9 rows)
Time: 255389.704 ms
Actually 255 sec seems to be a very good time for such a query. I'll try to extract the hostname part from all urls and make the update. I still should make sure that updating with exact matches is fast cause I had bad experience with it.
Thank you for your support.
Upvotes: 0
Reputation: 44250
This is not a solution, but a data-modelling work-around
The update could probably take a few minutes.
SET search_path='tmp';
DROP TABLE urls CASCADE;
CREATE TABLE urls
( id SERIAL NOT NULL PRIMARY KEY
, full_url varchar
, proto varchar
, hostname varchar
, pathname varchar
);
INSERT INTO urls(full_url) VALUES
( 'ftp://www.myhost.com/secret.tgz' )
,( 'http://www.myhost.com/robots.txt' )
,( 'http://www.myhost.com/index.php' )
,( 'https://www.myhost.com/index.php' )
,( 'http://www.myhost.com/subdir/index.php' )
,( 'https://www.myhost.com/subdir/index.php' )
,( 'http://www.hishost.com/index.php' )
,( 'https://www.hishost.com/index.php' )
,( 'http://www.herhost.com/index.php' )
,( 'https://www.herhost.com/index.php' )
;
UPDATE urls
SET proto = split_part(full_url, '://' , 1)
, hostname = split_part(full_url, '://' , 2)
;
UPDATE urls
SET pathname = substr(hostname, 1+strpos(hostname, '/' ))
, hostname = split_part(hostname, '/' , 1)
;
-- the full_url field is now redundant: we can drop it
ALTER TABLE urls
DROP column full_url
;
-- and we could always reconstruct the full_url from its components.
CREATE VIEW vurls AS (
SELECT id
, proto || '://' || hostname || '/' || pathname AS full_url
, proto
, hostname
, pathname
FROM urls
);
SELECT * FROM urls;
;
SELECT * FROM vurls;
;
OUTPUT:
INSERT 0 10
UPDATE 10
UPDATE 10
ALTER TABLE
CREATE VIEW
id | proto | hostname | pathname
----+-------+-----------------+------------------
1 | ftp | www.myhost.com | secret.tgz
2 | http | www.myhost.com | robots.txt
3 | http | www.myhost.com | index.php
4 | https | www.myhost.com | index.php
5 | http | www.myhost.com | subdir/index.php
6 | https | www.myhost.com | subdir/index.php
7 | http | www.hishost.com | index.php
8 | https | www.hishost.com | index.php
9 | http | www.herhost.com | index.php
10 | https | www.herhost.com | index.php
(10 rows)
id | full_url | proto | hostname | pathname
----+-----------------------------------------+-------+-----------------+------------------
1 | ftp://www.myhost.com/secret.tgz | ftp | www.myhost.com | secret.tgz
2 | http://www.myhost.com/robots.txt | http | www.myhost.com | robots.txt
3 | http://www.myhost.com/index.php | http | www.myhost.com | index.php
4 | https://www.myhost.com/index.php | https | www.myhost.com | index.php
5 | http://www.myhost.com/subdir/index.php | http | www.myhost.com | subdir/index.php
6 | https://www.myhost.com/subdir/index.php | https | www.myhost.com | subdir/index.php
7 | http://www.hishost.com/index.php | http | www.hishost.com | index.php
8 | https://www.hishost.com/index.php | https | www.hishost.com | index.php
9 | http://www.herhost.com/index.php | http | www.herhost.com | index.php
10 | https://www.herhost.com/index.php | https | www.herhost.com | index.php
(10 rows)
Upvotes: 3