Compare huge table(100 million rows) data between Oracle and SQL Server

Question

I have a process which populates an Oracle table which has over 100 million rows. The table structure is as follows

**ORACLE_TABLE**
id|contractdatetime|Attr3|Attr4|Attr5

The combination of (id,contractdatetime) is unique in this table, which gets populated using an external process.

The total distinct id is approx 30000 only. For every id there is a unique contractdatetime. The id isn't unique, but the combination of (id,contractdatetime) is

Now another process populates an identical table in SQL Server

**SQLSERVER_TABLE**
id|contractdatetime|Attr3|Attr4|Attr5

I am thinking about the best way to check if the data is both tables is identical. I thought if I can get a hashed version by contractid and somehow aggregate all the other attributes in Oracle. And if I can do the same in SQL Server, I would be able to compare this in Excel itself (30000) rows.

I have searched StackOverflow and couldn't get an identical function for MD5_XOR or anything which can help achieve this as per the link below. http://www.db-nemec.com/MD5/CompareTablesUsingMD5Hash.html

The other options of using a linked server etc would take lot more difficulty in getting approvals for doing this.

Is there a good way to go about with this?

Jon Heller · Accepted Answer

For a fast, high-level comparison between an Oracle and a SQL Server table you can use aggregations of the functions STANDARD_HASH and HASH_BYTES.

Oracle Code

--Create a simple table.
create table table1
(
    id number,
    contractdatetime date,
    Attr3 varchar2(100),
    Attr4 varchar2(100),
    Attr5 varchar2(100)
);

--Insert 4 rows, the first three will be identical between databases,
--the last row will be different.
insert into table1 values (1, date '2000-01-01', 'a', 'a', 'a');
insert into table1 values (2, date '2000-01-01', 'b', 'b', 'b');
insert into table1 values (2, date '2000-01-02', null, null, null);
insert into table1 values (3, date '2000-01-02', 'Oracle', 'Oracle', 'Oracle');
commit;

select
    id,
    --Format the number
    trim(to_number(
        --Sum per group.
        sum(
            --Convert to a number.
            to_number(
                --Get the first 14 bytes. This seems to be the maximum that SQL Server can handle
                --before it runs into math errors.
                substr(
                    --Hash the value.
                    standard_hash(
                        --Concatenate the values using (hopefully) unique strings to separate the
                        --columns and represent NULLs (because the hashing functions treat nulls differently.)
                        nvl(to_char(contractdatetime, 'YYYY-MM-DD HH24:MI:SS'), 'null') || 
                        '-1-' || nvl(attr3, 'null') || '-2-' || nvl(attr3, 'null') || '-3-' || nvl(attr3, 'null')
                        , 'MD5')
                    , 1, 14)
                , 'xxxxxxxxxxxxxxxxxxxx'))
        , '99999999999999999999')) hash
from table1
group by id
order by 1;

SQL Server Code

create table table1
(
    id numeric,
    contractdatetime datetime,
    Attr3 varchar(100),
    Attr4 varchar(100),
    Attr5 varchar(100)
);

insert into table1 values (1, cast('2000-01-01 00:00:00.000' as datetime), 'a', 'a', 'a');
insert into table1 values (2, cast('2000-01-01 00:00:00.000' as datetime), 'b', 'b', 'b');
insert into table1 values (2, cast('2000-01-02 00:00:00.000' as datetime), null, null, null);
insert into table1 values (3, cast('2000-01-02 00:00:00.000' as datetime), 'SQL Server', 'SQL Server', 'SQL Server');
commit;

select
    id,
    sum(
        convert(bigint, convert(varbinary, 
            substring(
                hashbytes('MD5',
                    isnull(convert(varchar(19), contractdatetime, 20), 'null') +
                    '-1-' + isnull(attr3, 'null') + '-2-' + isnull(attr3, 'null') + '-3-' + isnull(attr3, 'null'))
                , 1, 7)
            , 1))) hash
from table1
group by id
order by 1;

Results

As expected, the hashes for the first two groups are identical, and the hash for the third group is different.

Oracle:

ID  HASH
1   50696302970576522
2   69171702324546493
3   50787287321473273

SQL Server

ID  HASH
1   50696302970576522
2   69171702324546493
3   7440319042693061

Here is an Oracle fiddle and a SQL Server fiddle.

Problems

I assume this solution will only work if the databases use similar characters sets, or perhaps only use the first 127 ASCII characters that are often encoded the same in different character sets.
There is a (perhaps unreasonably) high chance of hash collisions. MD5 hashes aren't good enough for preventing cryptographic attacks, but they're good enough for comparing sets of data. The problem is that I had to use substrings to get the math to work for SQL Server. This is probably my fault for not understanding SQL Server well enough - BIGINTS should support roughly 19 digits of precision but my math only worked up to 14 digits. I probably have a conversion bug somewhere. You may need to play with the "14" and "7" numbers if you get too many collisions or overflow problems. (14 for Oracle, which counts based on the displayed hex characters. 7 for SQL Server, which counts based on the number of characters that can be represented by each hex characters, which is 0.5.)

Compare huge table(100 million rows) data between Oracle and SQL Server

Answers (1)

Oracle Code

SQL Server Code

Results

Problems

Related Questions