How to find duplicates from multiple tables at once in Teradata?

Question

There are 40 tables in the Data warehouse. Every day after the daily load I would check after there are any data issues in the tables. This is achieved using select queries to find the duplicates.

SELECT COALESCE (SUM(DUPS),0) AS DUPS_COUNT, 'PLAYER' AS TABLENAME FROM (Select Count(1) AS DUPS from DW.PLAYER group by PLAYERID having count(1) > 1) A
UNION
SELECT COALESCE (SUM(DUPS),0) AS DUPS_COUNT, 'PlayerBalance' AS TABLENAME FROM (Select Count(1) AS DUPS from DW.PlayerBalance group by PlayerID,SiteID having count(1) > 1) B
UNION
.
.
.
.
.
UNION
SELECT COALESCE (SUM(DUPS),0) AS DUPS_COUNT, 'TABLE40' AS TABLENAME FROM (Select Count(1) AS DUPS from DW.TABLE40 group by PLAYERID having count(1) > 1) AK

Sample output for one table:

I have the select query to find the duplicates for each of the 40 tables. All the individual 40 select statements are correct logically and syntactically. But rather than running one SQL select at a time I have created a common format for each table and did a UNION for all the 40 select queries.

When this query containing the 40 select queries combined with an UNION is being run then I see the output is being shown only for 16 tables instead of the 40 tables.

How can this be fixed so that all the 40 tables can be searched for duplicates in one go?

Gordon Linoff · Accepted Answer

I am going to expand on my comment.

Each of your subqueries is going to return exactly one row, regardless of the number of duplicates in each table. The first column will be 0 if there are no dups.

Hence, 40 queries should be running returning two columns. How could rows be missing? The obvious culprit would be UNION, because UNION removes duplicates. For the examples in the question, the second column is unique. However, if there are any duplicates names, then the counts might be the same in the first column . . . and voila! Rows would be missing due to the duplicate elimination using UNION.

I would recommend replacing UNION with UNION ALL. Even if this were not an issue, there is no reason to incur overhead for removing duplicates. In generally, you should always use UNION ALL, unless you want to use UNION -- and that is most of the time. Think of how much more often you use SELECT instead of SELECT DISTINCT.

How to find duplicates from multiple tables at once in Teradata?

Answers (2)

Related Questions