Reputation: 11
I have got a Synapse Workspace set up with a Dedictaed Pool running at DW100c.
The issue is my data set is very small - I am creating some dimensions and facts, and the largest FACT is around 300,000 records. Dimensions are much smaller.
We were using Azure SQL Database and had a view to migrate into Synapse. Some of the stored procedures we were running in SQL Database (2 cores) was taking around 2 mins whereas in Synapse it takes 6-7minutes.
I cant understand if SYnapse Dedicated Pool is a more upscaled why do my queries take much longer? If i scale it to 500 yeah it does speed up, but surely that cant be the answer? I have created the tables using round robin / replicated / hash where they are necessary.
Is the issue, my dataset is too small for Synapse Dedicated Pools?
Upvotes: 1
Views: 1764
Reputation: 79
Metaxim's advice is sound. Take it. But be aware it may not solve your performance issue.
When I compare across platforms I see a mixed bag of performance comparisons. The pattern seems to be analytical queries are at least as fast or faster in Synapse. Operational are slower.
Sounds like your issue could possibly be related to.... 1) not indexed correctly, 2, not distributed correctly, and 3) need to update stats. And of course like they said DW100c is very small. Check all of those.
A caveat on my advice is I have only done testing in Synapse. I've migrated part of our EDW to it and run test queries to compare across platforms. And I've done a lot of performance tuning on both platforms.
Note: For very small static dimensions I use clustered index replicated tables across nodes. For larger dimensions and all facts I use clustered columnstore with hash distribution. Data skew shows as very low.
My conclusion at this point is Synapse isn't viable for our needs. Some queries are a lot slower than IaaS Sql while some are a lot faster. The slow ones are a show stopper for us as we serve both operational and analytical needs and can't afford to slow down anything. The reason to use Synapse is to gain significant performance benefit. Right now I don't see that is a reality.
Below are 2 queries you need:
-- Hash distribution problems query
select two_part_name, distribution_policy_name, distribution_column, distribution_id, row_count
from dbo.vTableSizes
where two_part_name in
(
select two_part_name
from dbo.vTableSizes
where row_count > 0
group by two_part_name
having (max(row_count * 1.000) - min(row_count * 1.000))/max(row_count * 1.000) >= .10
)
and distribution_column is not null
order by two_part_name, row_count;
DECLARE @sql NVARCHAR(MAX) = '';
SELECT @sql += 'UPDATE STATISTICS [' + s.name + '].[' + t.name + '] WITH FULLSCAN; ' + CHAR(13)
FROM sys.tables AS t
JOIN sys.schemas AS s ON t.schema_id = s.schema_id
WHERE t.is_ms_shipped = 0; -- Excludes system tables
EXEC sp_executesql @sql;
Upvotes: 0
Reputation: 11
Synapse dedicated sql pool should still be fast even with just 300K records in round robin distribution as long as you update statistics before you start querying because the first query will initiate a full table scan.
When you do use hash distribution remember pick a distribution key that can spread fairly evenly across 60 nodes...meaning don't pick a key that only has a few unique values and/or a large percentage of null values.
Make sure your dimension tables are all replicated distribution.
Understand that your database is segregated into 60 different database nodes so if you query anything using WHERE clause on a specific distribution key (e.g., ProductID = 90099) then only 1 of your 60 nodes will be doing any work. At DW100c that is not much processing power. Choose a different distribution key so you can take advantage of the parallel processing capabilities.
You can try setting up some clustered and non clustered indexes on your fact tables and make sure to update statistics after any updates to your fact tables.
I normally avoid HASH distribution unless I have a table with 80M to 100M rows or more because row groups are generated on each of the 60 nodes and for compression you need each row group to have 1M records.
Upvotes: 0