Reputation: 43
My organisation is currently evaluating Azure Datawarehouse. We have a fact table that contains 16M rows and another that has 5M both of them are hash distributed on the same column (same data type and length).
When an inner join is performed at 200 DWU scale with smallrc resource class the query takes nearly 6 minutes, yet the DWU usage (as shown in portal) is only fraction of all available DWU's. Also, when the same query is initiated multiple times at the same time (through SSIS execute SQL tasks) the datawarehouse always executes 6 instances (maximum).
We have changed the DWU setting from 200 to 500 to 1000 and finally to 2000 but each time the maximum number of concurrent runs of the same query is limited to 6 only and the DWU usage always remains fraction of the total available, there is no noticeable change in performance either.
Is this expected behaviour? surely greatly increasing the DWU's should have provided massive decrease in query execution time? I appreciate that this is very generic but I am trying to understand if I have not considered something which is important for such an analysis?
Here is the Explain:
explain
<?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="10" number_distributions="60" number_distributions_per_node="6">
<sql>select shifts.[Shift Reference], Shifts.[Trust Code], Shifts.[Ward Code], shifts.[Location Code], Qualification, YYYYMM, booking_removal.[Booking Type], booking_removal.[Booking Type], count(distinct booking_removal.[Booking ID]) from shifts join booking_removal on shifts.[Shift reference] = booking_removal.[Shift Reference] join dimdate on shifts.[Shift Start Date] = date where year in (2015, 2016)
group by shifts.[Shift Reference], Shifts.[Trust Code], Shifts.[Ward Code], shifts.[Location Code], Qualification, YYYYMM, booking_removal.[Booking Type], booking_removal.[Booking Type]</sql>
<dsql_operations total_cost="1.22805816" total_number_operations="5">
<dsql_operation operation_type="RND_ID">
<identifier>TEMP_ID_15134</identifier>
</dsql_operation>
<dsql_operation operation_type="ON">
<location permanent="false" distribution="AllComputeNodes" />
<sql_operations>
<sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_15134] ([Date] DATE, [YYYYMM] INT ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
</sql_operations>
</dsql_operation>
<dsql_operation operation_type="BROADCAST_MOVE">
<operation_cost cost="1.22805816" accumulative_cost="1.22805816" average_rowsize="7" output_rows="730.987" GroupNumber="15" />
<source_statement>SELECT [T1_1].[Date] AS [Date],
[T1_1].[YYYYMM] AS [YYYYMM]
FROM (SELECT [T2_1].[Date] AS [Date],
[T2_1].[YYYYMM] AS [YYYYMM]
FROM [NHSP-Shifts-DW].[dbo].[DimDate] AS T2_1
WHERE (([T2_1].[Year] = CAST ((2015) AS INT))
OR ([T2_1].[Year] = CAST ((2016) AS INT)))) AS T1_1</source_statement>
<destination_table>[TEMP_ID_15134]</destination_table>
</dsql_operation>
<dsql_operation operation_type="RETURN">
<location distribution="AllDistributions" />
<select>SELECT [T1_1].[Shift Reference] AS [Shift Reference],
[T1_1].[Trust Code] AS [Trust Code],
[T1_1].[Ward Code] AS [Ward Code],
[T1_1].[Location Code] AS [Location Code],
[T1_1].[Qualification] AS [Qualification],
[T1_1].[YYYYMM] AS [YYYYMM],
[T1_1].[Booking Type] AS [Booking Type],
[T1_1].[Booking Type] AS [Booking Type1],
[T1_1].[col] AS [col]
FROM (SELECT COUNT([T2_1].[Booking ID]) AS [col],
[T2_1].[Shift Reference] AS [Shift Reference],
[T2_1].[Trust Code] AS [Trust Code],
[T2_1].[Ward Code] AS [Ward Code],
[T2_1].[Location Code] AS [Location Code],
[T2_1].[Qualification] AS [Qualification],
[T2_1].[YYYYMM] AS [YYYYMM],
[T2_1].[Booking Type] AS [Booking Type]
FROM (SELECT [T3_1].[Booking ID] AS [Booking ID],
[T3_2].[Shift Reference] AS [Shift Reference],
[T3_2].[Trust Code] AS [Trust Code],
[T3_2].[Ward Code] AS [Ward Code],
[T3_2].[Location Code] AS [Location Code],
[T3_2].[Qualification] AS [Qualification],
[T3_2].[YYYYMM] AS [YYYYMM],
[T3_1].[Booking Type] AS [Booking Type]
FROM [NHSP-Shifts-DW].[dbo].[Booking_Removal] AS T3_1
INNER JOIN
(SELECT [T4_2].[Shift Reference] AS [Shift Reference],
[T4_2].[Trust Code] AS [Trust Code],
[T4_2].[Ward Code] AS [Ward Code],
[T4_2].[Location Code] AS [Location Code],
[T4_2].[Qualification] AS [Qualification],
[T4_1].[YYYYMM] AS [YYYYMM]
FROM [tempdb].[dbo].[TEMP_ID_15134] AS T4_1
INNER JOIN
[NHSP-Shifts-DW].[dbo].[Shifts] AS T4_2
ON ([T4_1].[Date] = [T4_2].[Shift Start Date])) AS T3_2 ON ([T3_1].[Shift Reference] = [T3_2].[Shift Reference])
GROUP BY [T3_2].[Shift Reference], [T3_2].[Trust Code], [T3_2].[Ward Code], [T3_2].[Location Code], [T3_2].[Qualification], [T3_2].[YYYYMM], [T3_1].[Booking Type], [T3_1].[Booking ID]) AS T2_1
GROUP BY [T2_1].[Shift Reference], [T2_1].[Trust Code], [T2_1].[Ward Code], [T2_1].[Location Code], [T2_1].[Qualification], [T2_1].[YYYYMM], [T2_1].[Booking Type]) AS T1_1</select>
</dsql_operation>
<dsql_operation operation_type="ON">
<location permanent="false" distribution="AllComputeNodes" />
<sql_operations>
<sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_15134]</sql_operation>
</sql_operations>
</dsql_operation>
</dsql_operations>
</dsql_query>
When distributing the table I did check for data skew and the [shift reference] column almost provides none. Here are the results:
188648 123888 55888 8872 1 1
189096 123816 55960 9320 1 10
189352 123760 56000 9592 1 11
189488 124000 55960 9528 1 12
189096 123640 55984 9472 2 13
189544 123848 55952 9744 2 14
189024 123656 55952 9416 2 15
188960 123560 55984 9416 2 16
188256 123416 55944 8896 2 17
188840 123544 56000 9296 2 18
189096 123680 55960 9456 2 19
188840 123792 55936 9112 1 2
189288 123672 56000 9616 2 20
188648 123744 55976 8928 2 21
188840 123840 55976 9024 2 22
188776 123696 56032 9048 2 23
188392 123384 55944 9064 2 24
188384 123656 55976 8752 3 25
189096 123992 55968 9136 3 26
189032 123960 55968 9104 3 27
188968 123896 55960 9112 3 28
188776 123560 55976 9240 3 29
188840 123576 55960 9304 1 3
189024 123576 56000 9448 3 30
188768 123448 55920 9400 3 31
188896 123584 55992 9320 3 32
188960 123792 55984 9184 3 33
188456 123280 55992 9184 3 34
189736 124192 55976 9568 3 35
189288 123864 56024 9400 3 36
189168 123624 55976 9568 4 37
189096 123704 56016 9376 4 38
188968 123520 55992 9456 4 39
189672 123816 55952 9904 1 4
188904 123704 55944 9256 4 40
188712 123600 55976 9136 4 41
189160 123688 55976 9496 4 42
189416 124008 55976 9432 4 43
189032 123816 55952 9264 4 44
188968 123680 55984 9304 4 45
189352 123816 55992 9544 4 46
189416 124056 56008 9352 4 47
189608 123984 55992 9632 4 48
189160 123712 55968 9480 5 49
189224 123936 55928 9360 1 5
189480 123880 55968 9632 5 50
189480 123808 56032 9640 5 51
188840 123528 55968 9344 5 52
188960 123632 55936 9392 5 53
189024 123752 55952 9320 5 54
189216 123768 55976 9472 5 55
189600 123928 55960 9712 5 56
188456 123560 55912 8984 5 57
189664 124160 55976 9528 5 58
189096 123824 55976 9296 5 59
189288 123896 55960 9432 1 6
189096 123696 55936 9464 5 60
189224 123664 55952 9608 1 7
188776 123864 55920 8992 1 8
188712 123752 55920 9040 1 9
Upvotes: 1
Views: 897
Reputation: 506
How many rows is your query returning? If it's returning a lot of rows, the return operation on the client may be the bottleneck and thus scaling DWUs may not help. You mentioned that the currency is always 6. Could it be that this is an SSIS setting? In SQL DW the concurrency varies depending on the amount of DWUs allocated. This is explained in detail in the Concurrency and workload management in SQL Data Warehouse article. To see how many queries are actually running at once, you can run the query.
SELECT * FROM sys.dm_pdw_exec_requests WHERE status = 'Running';
Upvotes: 1
Reputation: 15849
Your effort per distribution is mostly the same, which is good. Having extra nodes involved can increase the processing power, but the effort in handling a single distribution isn't going to be impacted much. If you were CPU-bound, then sure. If you were IO-bound, then sure. But your query is taking a long time because of the grouping you're doing, and so on. You could improve the query in various ways, but increasing the DWU won't help much, except to allow more processing power for other queries. If you want to tune your query to make it run faster in general, then that's a different kind of question.
Upvotes: 1