Pirate X
Pirate X

Reputation: 3093

JOIN with GROUP BY causing SUM() logic issues

Query -

sel TableName, DatabaseName, sum(CurrentPerm/(1024*1024*1024)) as Size_in_GB
        from dbc.tablesize
        group by 1,2
        order by GB desc

result -

+-----------+--------+------------+
| TableName | DBName | Size_in_GB |
+-----------+--------+------------+
| WRP       | A      |  28,350.01 |
| CPC       | B      |  19,999.37 |
| SDF       | C      |  13,263.67 |
| DB1400    | D      |  13,200.26 |
+-----------+--------+------------+

From above simple query I can see that table WRP of database A is near 28350 GB

Now I am trying to join another table dbc.indices to use the column IndexType for filtering but now the Size_in_GB changes for all tables.

sel a.TableName,a.DatabaseName, sum(CurrentPerm/(1024*1024*1024)) as Size_in_GB from dbc.tablesize a
join dbc.indices b on a.TableName = b.TableName and a.DatabaseName=b.DatabaseName
--where b.indexType='P'
group by 1,2
order by Size_in_GB desc

Result is this -

+-----------+--------+------------+
| TableName | DBName | Size_in_GB |
+-----------+--------+------------+
| WRP       | A      |  56,700.02 |
| CPC       | B      |  39,998.74 |
| DB1400    | D      |  39,600.78 |
+-----------+--------+------------+

Now the same table is twice the size i.e. WRP is 56700 GB. (Similar for other tables)

I am not sure what's wrong with the logic I'm using for join.

P.S - My aim is to find all the tables which are greater than 100GB in Size and have indexType as 'P'

EDIT - Sharing relevant columns from DBC.INDICES table

+--------------+------------+-------------+-----------+------------+---------------+------------+----------------+
| DatabaseName | TableName  | IndexNumber | IndexType | UniqueFlag |   IndexName   | ColumnName | ColumnPosition |
+--------------+------------+-------------+-----------+------------+---------------+------------+----------------+
| Some DB      | Some Table |           1 | P         | N          | IndexNamehere | ColumnA    |              1 |
+--------------+------------+-------------+-----------+------------+---------------+------------+----------------+

Upvotes: 0

Views: 106

Answers (4)

dnoeth
dnoeth

Reputation: 60482

dbc.IndidesV (never use the old deprecated non-V-views) has one row per column per index.

You can simply add a condition to restrict it to a single row: where IndexType = 'P' and ColumnPosition = 1

And it's more efficient to do an early aggregation, i.e. aggregate before the join:

select t.*
from 
 (
   select TableName, DatabaseName,
      sum(CurrentPerm/(1024*1024*1024)) as Size_in_GB
   from dbc.TableSizeV
   group by 1,2
   having Size_in_GB > 100
 ) as dt
join dbc.IndicesV b 
  on a.TableName = b.TableName
 and a.DatabaseName=b.DatabaseName
where IndexType = 'P' 
  and ColumnPosition = 1
order by Size_in_GB desc;

But why do you want to filter for that IndexType=P, don't you care about other objects > 100GB (NoPI/Columnar tables, Join Indices)? Btw, this doesn't return all tables with a PI, only IndexNumber=1 does.

Depending on your needs you might better join to dbc.TablesV instead.

Upvotes: 0

Thorsten Kettner
Thorsten Kettner

Reputation: 95062

P.S - My aim is to find all the tables which are greater than 100GB in Size and have indexType as 'P'

If you only want to find certain tables for which an index exists, you shouln't join at all. Use EXISTS instead. This puts your condition in the WHERE or HAVING clause where it belongs and you don't have issues with your condition duplicating your records (in your case: it doesn't matter anymore when a table has more than one matching index).

select tablename, databasename, sum(currentperm/(1024*1024*1024)) as size_in_gb 
from dbc.tablesize ts
group by tablename, databasename
having sum(currentperm/(1024*1024*1024)) > 100
and exists
(
  select *
  from dbc.indices i
  where i.tablename = ts.tablename and i.databasename = ts.databasename
  and i.indexType = 'P'
)
order by Size_in_GB desc;

Upvotes: 0

Pரதீப்
Pரதீப்

Reputation: 93754

Probably your key is duplicated in dbc.indices table. For a single TableName ,dbc.indices table has more then one entry so when you join dbc.tablesize table records got duplicated so the SUM is applied on duplicate records so the mistake in calculation.

Try this way

SELECT a.TableName,
       a.DatabaseName,
       Sum(CurrentPerm / ( 1024 * 1024 * 1024 )) AS Size_in_GB
FROM   dbc.tablesize a
       JOIN (SELECT DISTINCT b.TableName,
                             b.DatabaseName
             FROM   dbc.indices b
             --where b.indexType='P'
             ) b
         ON a.TableName = b.TableName
            AND a.DatabaseName = b.DatabaseName

GROUP  BY a.TableName,
          a.DatabaseName
ORDER  BY Size_in_GB DESC 

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270573

What is confusing?

You clearly have tables that have multiple indexes. Each index is going to result in the table appearing more than one time for the aggregation.

For what you want:

My aim is to find all the tables which are greater than 100GB in Size and have indexType as 'P'

I would suggest moving the index comparison to the where clause:

select t.TableName, t.DatabaseName,
       sum(tCurrentPerm/(1024*1024*1024)) as Size_in_GB
from dbc.tablesize t
where exists (select 1
              from dbc.indices i
              where t.TableName = i.TableName and t.DatabaseName = i.DatabaseName and
                    i.indexType = 'P'
             )
group by 1,2
order by Size_in_GB desc

You can add having Size_in_GB > 100 before the order by if you want to add that filter as well.

Upvotes: 2

Related Questions