Reputation: 2493
There is a snipped of product code that does some row check. It's actually migrated code that came into teradata and no one has bothered to change it to be TD savvy, should I say. This code now throws
2646 : No More spool...
Error and that is not really a spool shortage but due to data-skew as would be evident to any Teradata Master.
Code logic is plain stupid but they are running it in Prod. Code change is NOT an option now because this is production. I can rewrite it using a Simple NOT Exists and the Query will run fine.
EXPLAIN SELECT ((COALESCE(FF.SKEW_COL,-99999))) AS Cnt1,
COUNT(*) AS Cnt
FROM DB.10_BILLON_FACT FF
WHERE FF.SKEW_COL IN(
SELECT F.SKEW_COL
FROM DB.10_BILLON_FACT F
EXCEPT
SELECT D.DIM_COL
FROM DB.Smaller_DIM D
)
Its failing because it wants to redistribute on SKEW_COL. WHATEVER I DO THIS WILL NOT CHANGE. SKEW_COL is 99% skewed.
here's the explain.FAILS ON STEP # 4.1
This query is optimized using type 2 profile insert-sel, profileid
10001.
1) First, we lock a distinct DB."pseudo table" for read on a
RowHash to prevent global deadlock for DB.F.
2) Next, we lock a distinct DB."pseudo table" for read on a
RowHash to prevent global deadlock for DB.D.
3) We lock DB.F for read, and we lock DB.D for read.
4) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from DB.F by way of an
all-rows scan with no residual conditions into Spool 6
(all_amps), which is redistributed by the hash code of (
DB.F.SKEW_COL) to all AMPs. Then we
do a SORT to order Spool 6 by row hash and the sort key in
spool field1 eliminating duplicate rows. The size of Spool 6
is estimated with low confidence to be 989,301 rows (
28,689,729 bytes). The estimated time for this step is 1
minute and 36 seconds.
2) We do an all-AMPs RETRIEVE step from DB.D by way of an
all-rows scan with no residual conditions into Spool 7
(all_amps), which is built locally on the AMPs. Then we do a
SORT to order Spool 7 by the hash code of (
DB.D.DIM_COL). The size of Spool 7 is
estimated with low confidence to be 6,118,545 rows (
177,437,805 bytes). The estimated time for this step is 0.11
seconds.
5) We do an all-AMPs JOIN step from Spool 6 (Last Use) by way of an
all-rows scan, which is joined to Spool 7 (Last Use) by way of an
all-rows scan. Spool 6 and Spool 7 are joined using an exclusion
merge join, with a join condition of ("Field_1 = Field_1"). The
result goes into Spool 1 (all_amps), which is built locally on the
AMPs. The size of Spool 1 is estimated with low confidence to be
494,651 rows (14,344,879 bytes). The estimated time for this step
is 3.00 seconds.
6) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from Spool 1 (Last Use) by
way of an all-rows scan into Spool 5 (all_amps), which is
redistributed by the hash code of (
DB.F.SKEW_COL) to all AMPs. Then we
do a SORT to order Spool 5 by row hash. The size of Spool 5
is estimated with low confidence to be 494,651 rows (
12,366,275 bytes). The estimated time for this step is 0.13
seconds.
2) We do an all-AMPs RETRIEVE step from DB.FF by way of an
all-rows scan with no residual conditions into Spool 8
(all_amps) fanned out into 24 hash join partitions, which is
built locally on the AMPs. The size of Spool 8 is estimated
with high confidence to be 2,603,284,805 rows (
54,668,980,905 bytes). The estimated time for this step is
24.40 seconds.
7) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by way of
an all-rows scan into Spool 9 (all_amps) fanned out into 24 hash
join partitions, which is duplicated on all AMPs. The size of
Spool 9 is estimated with low confidence to be 249,304,104 rows (
5,235,386,184 bytes). The estimated time for this step is 1.55
seconds.
8) We do an all-AMPs JOIN step from Spool 8 (Last Use) by way of an
all-rows scan, which is joined to Spool 9 (Last Use) by way of an
all-rows scan. Spool 8 and Spool 9 are joined using a inclusion
hash join of 24 partitions, with a join condition of (
"SKEW_COL = SKEW_COL"). The
result goes into Spool 4 (all_amps), which is built locally on the
AMPs. The size of Spool 4 is estimated with index join confidence
to be 1,630,304,007 rows (37,496,992,161 bytes). The estimated
time for this step is 11.92 seconds.
9) We do an all-AMPs SUM step to aggregate from Spool 4 (Last Use) by
way of an all-rows scan , grouping by field1 (
DB.FF.SKEW_COL). Aggregate Intermediate
Results are computed globally, then placed in Spool 11. The size
of Spool 11 is estimated with low confidence to be 494,651 rows (
14,344,879 bytes). The estimated time for this step is 35.00
seconds.
10) We do an all-AMPs RETRIEVE step from Spool 11 (Last Use) by way of
an all-rows scan into Spool 2 (group_amps), which is built locally
on the AMPs. The size of Spool 2 is estimated with low confidence
to be 494,651 rows (16,323,483 bytes). The estimated time for
this step is 0.01 seconds.
11) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 2 are sent back to the user as the result of
statement 1. The total estimated time is 2 minutes and 52 seconds.
There are some 900K unique values of skewed_ column and * ( interestingly there are 6 Million unique values for DIM_COL, which is why I think it is veering towards the Fact table column. But still..it knows from the Low Unique value in the bigger table, that its badly skewed ) My Q is after knowing that SKEWED_COL is 99% skewed due to a constant value like -9999 WHY does the optimizer still redistribute by this skewed column instead of using alternate PRPD approach. A similar ( but not same ) situation happened in past but when we upgraded to faster box ( more AMPS ) it went away .
Anything that comes to mind that will make it change plans. I tried most diagnostics - no result. Created a SI ( On a similar VT but it will still skew ).SKEWING is inevitable , ( You can artificially change the data - I am aware so to minimize this BUT all that is NOT after the fact. Now we are in PROD. Everything is over ) but even after it knows the Col is Skewed, why re-distribute it when other options are available
Its not the NULL value that skewing . Its a constant flag value ( probably value rep. of the NULL like -9999 that is causing the skew as I mentioned in the poster ) . If you rewrite the Q as I updated it works fine. I preferred NOT EXISTS because the latter will not need NULL CHECKING ( as a practice though from my DD knowledge - i know both cols are declared NOT NULL ) . I have updated the Poster with an alternative code that will work ( though like I explained - i finalized with the NOT exists version)
Select count(*) , f.SKEW_COL
from (
select ff.SKEW_COL
from DB.10_BILLON_FACT ff
where ff.SKEW_COL not in (
select d.DIM_COL
from DB.Smaller_DIM d )) as f
Group by f.SKEW_COL
Can I not get the optimizer query rewrite feature to think through the Q and rewrite with above logic. The above will NOT redistribute but JUST SORT By the Skewed Column
Upvotes: 0
Views: 1097
Reputation: 7786
Until you can replace the SQL, adding spool may be your only option.
Make sure your stats are current or consider a join index with an alternative PI that covers this particular query without having to do the redistribution. You may have a skewed JI but if the work can be done AMP local you may be able to address the spool issue.
Upvotes: 1