Reputation: 11617
I have to maintain a scary legacy database that is very poorly designed. All the tables have more than 100 columns - one has 650. The database is very denormalized and I have found that often the same data is expressed in several columns in the same row.
For instance, here is a sample of columns for one of the tables:
[MEMBERADDRESS] [varchar](331) NULL,
[DISPLAYADDRESS] [varchar](max) NULL,
[MEMBERINLINEADDRESS] [varchar](max) NULL,
[DISPLAYINLINEADDRESS] [varchar](250) NULL,
[__HISTISDN] [varchar](25) NULL,
[HISTISDN] [varchar](25) NULL,
[MYDIRECTISDN] [varchar](25) NULL,
[MYISDN] [varchar](25) NULL,
[__HISTALT_PHONE] [varchar](25) NULL,
[HISTALT_PHONE] [varchar](25) NULL,
It turns out that MEMBERADDRESS
and DISPLAYADDRESS
have the same value for all rows in the table. The same is true for the other clusters of fields I have shown here.
It will be very difficult and time consuming to identify all cases like this manually. Is it possible to create a query that would identify if two fields have the same value in every row in a table?
If not, are there any existing tools that will help me identify these sorts of problems?
Upvotes: 2
Views: 263
Reputation: 1270873
The following approach uses unpivot
to create triples. It makes some assumptions: values are not null; each row has an id; and columns have compatible types.
select t.which, t2.which
from (select id, which, value
from MEMBERADDRESS
unpivot (value for which in (<list of columns here>)) up
) t full outer join
(select id, which, value
from MEMBERADDRESS
unpivot (value for which in (<list of columns here>)) up
) t2
on t.id = t2.id and t.which <> t2.which
group by t.which, t2.which
having sum(case when t.value = t2.value then 1 else 0 end) = count(*)
It works by creating a new table with three columns: id, which column, and the value in the column. It then does a self join on id (to keep comparisons within one row) and value (to get matching values). This self-join should always match, because the columns are the same in the two halves of the query.
The having
then counts the number of values that are the same on both sides for a given pair of columns. When all these are the same, then the match is successful.
You can also leave out the having
clause and use something like:
select t.which, t2.which, sum(case when t.value = t2.value then 1 else 0 end) as Nummatchs,
count(*) as NumRows
To get more complete information.
Upvotes: 1
Reputation: 726987
There are two approaches I see to simplify this query:
Here is a quick illustration of the second approach:
SELECT id, name, val FROM (
SELECT id, MEMBERADDRESS as val,'MEMBERADDRESS' as name FROM MyTable
UNION ALL
SELECT id, DISPLAYADDRESS as val,'DISPLAYADDRESS' as name FROM MyTable
UNION ALL
SELECT id, MEMBERINLINEADDRESS as val,'MEMBERINLINEADDRESS' as name FROM MyTable
UNION ALL
...
) first
JOIN (
SELECT id, MEMBERADDRESS as val,'MEMBERADDRESS' as name FROM MyTable
UNION ALL
SELECT id, DISPLAYADDRESS as val,'DISPLAYADDRESS' as name FROM MyTable
UNION ALL
SELECT id, MEMBERINLINEADDRESS as val,'MEMBERINLINEADDRESS' as name FROM MyTable
UNION ALL
...
) second ON first.id=second.id AND first.value=second.value
There is a lot of manual work for 100 columns (at least it does not grow as N^2
, as in the first approach, but it is still a lot of manual typing). You may be better off generating the selects connected with UNION ALL
using a small script.
Upvotes: 2