Oliver
Oliver

Reputation: 11617

A query to find if any two fields in a row are equal?

I have to maintain a scary legacy database that is very poorly designed. All the tables have more than 100 columns - one has 650. The database is very denormalized and I have found that often the same data is expressed in several columns in the same row.

For instance, here is a sample of columns for one of the tables:

[MEMBERADDRESS] [varchar](331) NULL,
[DISPLAYADDRESS] [varchar](max) NULL,

[MEMBERINLINEADDRESS] [varchar](max) NULL,
[DISPLAYINLINEADDRESS] [varchar](250) NULL,

[__HISTISDN] [varchar](25) NULL,
[HISTISDN] [varchar](25) NULL,
[MYDIRECTISDN] [varchar](25) NULL,
[MYISDN] [varchar](25) NULL,

[__HISTALT_PHONE] [varchar](25) NULL,
[HISTALT_PHONE] [varchar](25) NULL,

It turns out that MEMBERADDRESS and DISPLAYADDRESS have the same value for all rows in the table. The same is true for the other clusters of fields I have shown here.

It will be very difficult and time consuming to identify all cases like this manually. Is it possible to create a query that would identify if two fields have the same value in every row in a table?

If not, are there any existing tools that will help me identify these sorts of problems?

Upvotes: 2

Views: 263

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1270873

The following approach uses unpivot to create triples. It makes some assumptions: values are not null; each row has an id; and columns have compatible types.

select t.which, t2.which 
from (select id, which, value
      from MEMBERADDRESS
      unpivot (value for which in (<list of columns here>)) up
     ) t full outer join
     (select id, which, value
      from MEMBERADDRESS
      unpivot (value for which in (<list of columns here>)) up
     ) t2
     on t.id = t2.id and t.which <> t2.which
group by t.which, t2.which
having sum(case when t.value = t2.value then 1 else 0 end) = count(*)

It works by creating a new table with three columns: id, which column, and the value in the column. It then does a self join on id (to keep comparisons within one row) and value (to get matching values). This self-join should always match, because the columns are the same in the two halves of the query.

The having then counts the number of values that are the same on both sides for a given pair of columns. When all these are the same, then the match is successful.

You can also leave out the having clause and use something like:

select t.which, t2.which, sum(case when t.value = t2.value then 1 else 0 end) as Nummatchs,
       count(*) as NumRows

To get more complete information.

Upvotes: 1

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726987

There are two approaches I see to simplify this query:

  • Write a script that generates your queries - feed your script the name of the table and the suspected columns, and let it produce a query that checks each pair of columns for equality. This is the fastest approach to implement in a one-of situation like yours.
  • Write a query that "normalizes" your data, and search against it - self-join the query to itself, then filter out the duplicates.

Here is a quick illustration of the second approach:

SELECT id, name, val FROM (
    SELECT id, MEMBERADDRESS as val,'MEMBERADDRESS' as name FROM MyTable
    UNION ALL
    SELECT id, DISPLAYADDRESS as val,'DISPLAYADDRESS' as name FROM MyTable
    UNION ALL
    SELECT id, MEMBERINLINEADDRESS as val,'MEMBERINLINEADDRESS' as name FROM MyTable
    UNION ALL
    ...
) first
JOIN (
    SELECT id, MEMBERADDRESS as val,'MEMBERADDRESS' as name FROM MyTable
    UNION ALL
    SELECT id, DISPLAYADDRESS as val,'DISPLAYADDRESS' as name FROM MyTable
    UNION ALL
    SELECT id, MEMBERINLINEADDRESS as val,'MEMBERINLINEADDRESS' as name FROM MyTable
    UNION ALL
    ...
) second ON first.id=second.id AND first.value=second.value

There is a lot of manual work for 100 columns (at least it does not grow as N^2, as in the first approach, but it is still a lot of manual typing). You may be better off generating the selects connected with UNION ALL using a small script.

Upvotes: 2

Related Questions