Reputation: 61
How can we make the queries like below faster in snowflake.
SELECT ColumnA, ColumnB, ColumnC
FROM TableA
WHERE ColumnA = 'ABC123'
SELECT ColumnA, ColumnB, ColumnC
FROM TableA
WHERE ColumnA IN ('ABC123', 'ABC456', 'ABCnnn')
Table A has 500 million rows, 200 columns, 30GB compressed size and column A has 300 million unique values.
ColumnA
cannot be used as the clustering key as the table is already clustered by other column which is necessary. Also, search optimization is already switched on for the table.
A typical query with 15K values in IN clause returns results in 30 seconds with XL warehouse, whereas the legacy datastores which are indexed on these columns results in 1 sec.
Upvotes: 0
Views: 1368
Reputation: 463
Snowflake uses micro-partitioning as its hybrid partitioning technology; what you are describing here may be more of a data architecture / modelling limitation.
For sure as the others have commented, 15k search keys in the IN clause is a bad idea --- if you think internally it needs to parse 15,000 search values to the static pruning step and this will lead to a long compilation time. We often recommend to not do that that maybe a better solution would be to include those 15,000 values in a separate table and include it in an INNER JOIN. This is dynamic pruning and needed for such a requirement to find so many matches.
Snowflake is Deep Right Join Tree optimised meaning that if the above content is a part of a Star-Join query then performance might even be faster!
Upvotes: 0
Reputation: 26078
Wow, a join key that is a string, is bad for performance.
But an IN
clauses with 15K values is worse. 200 columns is not that ideal ether.
I suspect this will not be the case, but if the query that needs speeding up is only three columns, I would make a second table with only those three columns, and then do multi-table updates/inserts, when upserting to keep them in sync.
With 15K values in a IN clause you are doing a full table scan, thus your time will be completely IO bound, and you should find you get something like ~40MBps per instance, thus to read it faster (this was the number I recall we used to notice our big IO bound tasks took), will need larger instances. But that leaves you with a constant cost.
To avoid a full table scan you somehow need to let Snowflake help you avoid some data (if you are selecting 3/200 columns, that is some reduction). So this is ether done via clustering on you key, or limiting the data by some other criteria like insertion time. or some other filters that can drop rows.
If there is nothing that you can pre-filter that data on, then you are in the joyous place of full table scans.
Upvotes: 1
Reputation: 9818
Assuming that you are using Enterprise Edition or above, you could try Search Optimization
Upvotes: 0