Theo F
Theo F

Reputation: 1355

PostgreSQL: UPDATE large table

I have a large PostgreSQL table of 29 million rows. The size (according to the stats tab in pgAdmin is almost 9GB.) The table is post-gis enabled with an empty geometry column.

I want to UPDATE the geometry column using ST_GeomFromText, reading from X and Y coordinate columns (SRID: 27700) stored in the same table. However, running this query on the whole table at once results in 'out of disk space' and 'connection to server lost' errors... the latter being less frequent.

To overcome this, should I UPDATE the 29 million rows in batches/stages? How can I do 1 million rows (the FIRST 1 million), then do the next 1 million rows until I reach 29 million?

Or are there other more efficient ways of updating large tables like this?

I should add, the table is hosted in AWS.

My UPDATE query is:

UPDATE schema.table
SET geom = ST_GeomFromText('POINT(' || eastingcolumn || ' ' || northingcolumn || ')',27700);

Upvotes: 5

Views: 12202

Answers (2)

Theo F
Theo F

Reputation: 1355

From my original question:

However, running this query on the whole table at once results in 'out of disk space' and 'connection to server lost' errors... the latter being less frequent.

Turns out our Amazon AWS instance database was running out of space, stopping my original ST_GeomFromText query from completing. Freeing up space fixed it.

On an important note, as suggested by @mlinth, ST_Point ran my query far quicker than ST_GeomFromText (24mins vs 2hours).

My final query being:

UPDATE schema.tablename
SET geom = ST_SetSRID(ST_Point(eastingcolumn,northingcolumn),27700);

Upvotes: 4

filiprem
filiprem

Reputation: 7124

You did not give any server specs, writing 9GB can be pretty fast on recent hardware.

You should be OK with one, long, update - unless you have concurrent writes to this table.

A common trick to overcome this problem (a very long transaction, locking writes to the table) is to split the UPDATE into ranges based on the primary key, ran in separate transactions.

/* Use PK or any attribute with a known distribution pattern */
UPDATE schema.table SET ... WHERE id BETWEEN 0 AND 1000000;
UPDATE schema.table SET ... WHERE id BETWEEN 1000001 AND 2000000;

For high level of concurrent writes people use more subtle tricks (like: SELECT FOR UPDATE / NOWAIT, lightweight locks, retry logic, etc).

Upvotes: 9

Related Questions