Error 2646 Teradata : Redistribution on a badly skewed column after FULL stats

Question

There is a snipped of product code that does some row check. It's actually migrated code that came into teradata and no one has bothered to change it to be TD savvy, should I say. This code now throws

2646 : No More spool...

Error and that is not really a spool shortage but due to data-skew as would be evident to any Teradata Master.

Code logic is plain stupid but they are running it in Prod. Code change is NOT an option now because this is production. I can rewrite it using a Simple NOT Exists and the Query will run fine.

    EXPLAIN SELECT  ((COALESCE(FF.SKEW_COL,-99999)))  AS Cnt1,
        COUNT(*) AS Cnt 
 FROM   DB.10_BILLON_FACT FF 
WHERE   FF.SKEW_COL IN(
SELECT  F.SKEW_COL 
FROM    DB.10_BILLON_FACT F 

EXCEPT  
SELECT  D.DIM_COL 
FROM    DB.Smaller_DIM D



)

Its failing because it wants to redistribute on SKEW_COL. WHATEVER I DO THIS WILL NOT CHANGE. SKEW_COL is 99% skewed.

here's the explain.FAILS ON STEP # 4.1

 This query is optimized using type 2 profile insert-sel, profileid
 10001.
  1) First, we lock a distinct DB."pseudo table" for read on a
     RowHash to prevent global deadlock for DB.F.
  2) Next, we lock a distinct DB."pseudo table" for read on a
     RowHash to prevent global deadlock for DB.D.
  3) We lock DB.F for read, and we lock DB.D for read.
  4) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from DB.F by way of an
          all-rows scan with no residual conditions into Spool 6
          (all_amps), which is redistributed by the hash code of (
          DB.F.SKEW_COL) to all AMPs.  Then we
          do a SORT to order Spool 6 by row hash and the sort key in
          spool field1 eliminating duplicate rows.  The size of Spool 6
          is estimated with low confidence to be 989,301 rows (
          28,689,729 bytes).  The estimated time for this step is 1
          minute and 36 seconds.
       2) We do an all-AMPs RETRIEVE step from DB.D by way of an
          all-rows scan with no residual conditions into Spool 7
          (all_amps), which is built locally on the AMPs.  Then we do a
          SORT to order Spool 7 by the hash code of (
          DB.D.DIM_COL).  The size of Spool 7 is
          estimated with low confidence to be 6,118,545 rows (
          177,437,805 bytes).  The estimated time for this step is 0.11
          seconds.
  5) We do an all-AMPs JOIN step from Spool 6 (Last Use) by way of an
     all-rows scan, which is joined to Spool 7 (Last Use) by way of an
     all-rows scan.  Spool 6 and Spool 7 are joined using an exclusion
     merge join, with a join condition of ("Field_1 = Field_1").  The
     result goes into Spool 1 (all_amps), which is built locally on the
     AMPs.  The size of Spool 1 is estimated with low confidence to be
     494,651 rows (14,344,879 bytes).  The estimated time for this step
     is 3.00 seconds.
  6) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from Spool 1 (Last Use) by
          way of an all-rows scan into Spool 5 (all_amps), which is
          redistributed by the hash code of (
          DB.F.SKEW_COL) to all AMPs.  Then we
          do a SORT to order Spool 5 by row hash.  The size of Spool 5
          is estimated with low confidence to be 494,651 rows (
          12,366,275 bytes).  The estimated time for this step is 0.13
          seconds.
       2) We do an all-AMPs RETRIEVE step from DB.FF by way of an
          all-rows scan with no residual conditions into Spool 8
          (all_amps) fanned out into 24 hash join partitions, which is
          built locally on the AMPs.  The size of Spool 8 is estimated
          with high confidence to be 2,603,284,805 rows (
          54,668,980,905 bytes).  The estimated time for this step is
          24.40 seconds.
  7) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by way of
     an all-rows scan into Spool 9 (all_amps) fanned out into 24 hash
     join partitions, which is duplicated on all AMPs.  The size of
     Spool 9 is estimated with low confidence to be 249,304,104 rows (
     5,235,386,184 bytes).  The estimated time for this step is 1.55
     seconds.
  8) We do an all-AMPs JOIN step from Spool 8 (Last Use) by way of an
     all-rows scan, which is joined to Spool 9 (Last Use) by way of an
     all-rows scan.  Spool 8 and Spool 9 are joined using a inclusion
     hash join of 24 partitions, with a join condition of (
     "SKEW_COL = SKEW_COL").  The
     result goes into Spool 4 (all_amps), which is built locally on the
     AMPs.  The size of Spool 4 is estimated with index join confidence
     to be 1,630,304,007 rows (37,496,992,161 bytes).  The estimated
     time for this step is 11.92 seconds.
  9) We do an all-AMPs SUM step to aggregate from Spool 4 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     DB.FF.SKEW_COL).  Aggregate Intermediate
     Results are computed globally, then placed in Spool 11.  The size
     of Spool 11 is estimated with low confidence to be 494,651 rows (
     14,344,879 bytes).  The estimated time for this step is 35.00
     seconds.
 10) We do an all-AMPs RETRIEVE step from Spool 11 (Last Use) by way of
     an all-rows scan into Spool 2 (group_amps), which is built locally
     on the AMPs.  The size of Spool 2 is estimated with low confidence
     to be 494,651 rows (16,323,483 bytes).  The estimated time for
     this step is 0.01 seconds.
 11) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 2 are sent back to the user as the result of
     statement 1.  The total estimated time is 2 minutes and 52 seconds.

There are some 900K unique values of skewed_ column and * ( interestingly there are 6 Million unique values for DIM_COL, which is why I think it is veering towards the Fact table column. But still..it knows from the Low Unique value in the bigger table, that its badly skewed ) My Q is after knowing that SKEWED_COL is 99% skewed due to a constant value like -9999 WHY does the optimizer still redistribute by this skewed column instead of using alternate PRPD approach. A similar ( but not same ) situation happened in past but when we upgraded to faster box ( more AMPS ) it went away .

Anything that comes to mind that will make it change plans. I tried most diagnostics - no result. Created a SI ( On a similar VT but it will still skew ).SKEWING is inevitable , ( You can artificially change the data - I am aware so to minimize this BUT all that is NOT after the fact. Now we are in PROD. Everything is over ) but even after it knows the Col is Skewed, why re-distribute it when other options are available

Its not the NULL value that skewing . Its a constant flag value ( probably value rep. of the NULL like -9999 that is causing the skew as I mentioned in the poster ) . If you rewrite the Q as I updated it works fine. I preferred NOT EXISTS because the latter will not need NULL CHECKING ( as a practice though from my DD knowledge - i know both cols are declared NOT NULL ) . I have updated the Poster with an alternative code that will work ( though like I explained - i finalized with the NOT exists version)

Select     count(*) , f.SKEW_COL 
from      ( 
select  ff.SKEW_COL 
from      DB.10_BILLON_FACT ff 
where   ff.SKEW_COL not in ( 
select  d.DIM_COL 
from      DB.Smaller_DIM d  )) as  f
Group   by f.SKEW_COL

Can I not get the optimizer query rewrite feature to think through the Q and rewrite with above logic. The above will NOT redistribute but JUST SORT By the Skewed Column

Rob Paller · Accepted Answer

Until you can replace the SQL, adding spool may be your only option.

Make sure your stats are current or consider a join index with an alternative PI that covers this particular query without having to do the redistribution. You may have a skewed JI but if the work can be done AMP local you may be able to address the spool issue.

Error 2646 Teradata : Redistribution on a badly skewed column after FULL stats

Answers (1)

Related Questions