Jamsandwich
Jamsandwich

Reputation: 634

Oracle - Select Only Columns That Contain Data

We have a database with a vast number of tables and columns that was set up by a 3rd party.

Many of these columns are entirely unused. I am trying to create a query that returns a list of all the columns that are actually used (contain > 0 values).

My current attempt -

SELECT table_name, column_name
FROM ALL_TAB_COLUMNS
WHERE OWNER = 'XUSER' 
    AND num_nulls < 1
;

Using num_nulls < 1 dramatically reduces the number of returned values, as expected.

However, on inspection of some of the tables, there are columns missing from the results of the query that appear to have values in them.

Could anybody explain why this might be the case?

Upvotes: 1

Views: 2097

Answers (2)

APC
APC

Reputation: 146269

"there are columns missing from the results of the query that appear to have values in them."

Potentially every non-mandatory column could appear in this set, because it is likely that some rows will have values but not all rows. "Some rows" being greater than zero means such columns won't pass your test for num_nulls < 1.

So maybe you should search for columns which aren't in use. This query will find columns where every row is null:

 select t.table_name
        , tc.column_name
 from user_tables t
      join user_tab_cols tc on t.table_name = tc.table_name
 where t.num_rows > 0
and  t.num_rows = tc.num_nulls;

Note that if you are using Partitioning you will need to scan user_tab_partitions.num_rows and user_part_col_statistics.num_nulls.

Also, I second the advice others have given regarding statistics. The above query may throw out some false positives. I would treat the results generated from that query as a list of candidates to be investigated further. For instance you could generate queries which counted the actual number of nulls for each column.

Upvotes: 0

Matthew McPeak
Matthew McPeak

Reputation: 17944

First of all, statistics are not always 100% accurate. They can be gathered on a subset of the table rows, since they are, after all, statistics. Just like pollsters do not have to ask every American how they feel about a given politician, Oracle can get an accurate-enough sense of the data in a table by reading only a portion of it.

Even if the statistics were gathered on 100% of the rows in a table (and they can be gathered that way, if you want), the statistics will become outdated as soon as there are any inserts, updates, or deletes on the table.

Second of all, num_nulls < 1 wouldn't tell you the columns that had no data. Imagine a table with 100 rows and "column X" having num_nulls equal to 80. That would imply the column has 20 non-null values, but would NOT pass your filter. A better approach (if you trust your statistics are not stale and based on a 100% sample of the rows), might be to compare DBA_TAB_COLUMNS.NUM_NULLS < DBA_TABLES.NUM_ROWS. For example, a column that has 99 nulls in a 100 row table has data in 1 row.

Upvotes: 1

Related Questions