Garvit Gupta
Garvit Gupta

Reputation: 11

Fastest way to compare multiple column values in sql server?

I have a Table in sql server consisting of 200 million records in two different servers. I need to move this table from Server 1 to Server 2.

Table in server 1 can be a subset or a superset of the table in server 2. Some of the records(around 1 million) in server 1 are updated which I need to update in server 2. So currently I am following this approach :-

1) Use SSIS to move data from server 1 to staging database in server 2.

2) Then compare data in staging with the table in server 2 column by column. If any of the column is different, I update the whole row.

This is taking a lot of time. I tried using hashbytes inorder to compare rows like this:- HASHBYTES('sha',CONCAT(a.[account_no],a.[transaction_id], ...))

<>

HASHBYTES('sha',CONCAT(b.[account_no],b.[transaction_id], ...))

But this is taking even more time.

Any other approach which can be faster and can save time?

Upvotes: 1

Views: 2461

Answers (1)

GKlesczewski
GKlesczewski

Reputation: 302

This is a problem that's pretty common.

First - do not try and do the updates directly in SQL - the performance will be terrible, and will bring the database server to its knees.

In context, TS1 will be the table on Server 1, TS2 will be the table on Server 2 Using SSIS - create two steps within the job: First, find the deleted - scan TS2 by ID, and any TS2 ID that does not exist in TS1, delete it. Second, scan TS1, and if the ID exists in TS2, you will need to update that record. If memory serves, SSIS can inspect for differences and only update if needed, otherwise, just execute the update statement. While scanning TS1, if the ID does not exist in TS2, then insert the record.

I can't speak to performance on this due to variations in schemas as servers, but it will be compute intensive to analyze the 200mm records. It WILL take a long time.

For on-going execution, you will need to add a "last modified date" timestamp to each record and a trigger to update the field on any legitimate change. Then use that to filter out your problem space. The first scan will not be terrible, as it ONLY looks at the IDs. The insert/update phase will actually benefit from the last modified date filter, assuming the number of records being modified is small (< 5%?) relative to the overall dataset. You will also need to add an index to that column to aid in the filtering.

The other option is to perform a burn and load each time - disable any constraints around TS2, truncate TS2 and copy the data into TS2 from TS1, finally reenabling the constraints and rebuild any indexes.

Best of luck to you.

Upvotes: 1

Related Questions