Reputation: 165

Adding new columns to a column database having billions of rows

I want to add a new column to a table which already consists billions of rows. The new columns are derived from existing columns.

For example,

new_col1 = old_col1 + old_col2
new_col2 = old_col1 / old_col2

I am trying to do this in following way -

Add new columns

ALTER TABLE table_name
ADD (   column_1    column-definition,
column_2    column-definition,
... 
column_n    column_definition )

Read rows one by one from the table and fill the values for new columns.

There is no primary key in the database. So I can not refer to an individual row. To read rows one by one, I have to do a select * which would give a huge resultset (considering billions of records).

Is there any better way to do this?

Upvotes: 3

Answers (3)

Chris Diver

Reputation: 19842

Different DBMS have different SQL dialects, it is useful to specify which you are using in the question.

In SQL Server you could use a Computed Column but this would calculate the result every time you select the data, you could flag it as persisted but it may take a while to make the change. But you can't do that if you are going to remove the old columns.

Alternatively create the new column allowing nulls and then update it in batches

UPDATE TOP (1000) table_name SET new_col1 = old_col1 + col_col2 WHERE new_col1 IS NULL

Again the query is for SQL Server, but there will alternatives for your DBMS.

Also read Mr Hoopers comment about adding an index to the new column to make sure that the performance of the UPDATE doesn't get worse as more data is added. The update is a read and write operation, the index will speed up the reads and slightly delay the writes (maintaining the index), but it should be worthwhile.

Upvotes: 2

Brian Hooper

Reputation: 22084

I think Mr Diver's method would be fine if you also added an index on one of your new columns; otherwise, as the job progresses, it will have to do more and more scanning to find the rows it hasn't already updated. Adding an index will mean it doesn't have to do that. A possible drawback is that the index differentiation will be frightful when the column is created, but I don't think that would be a problem as you only care about NULL or NOT NULL. You could drop the index when the update is complete.

Upvotes: 1

Flakron Bytyqi

Reputation: 3254

use stored procedures, do an update by 100 of them, add the stored procedure as a job to run every say 30 seconds.

Upvotes: 0

Adding new columns to a column database having billions of rows

Answers (3)

Related Questions