Reputation: 1042
I have a table with 40+ million rows in MS SQL Server 2019. One of the columns store pure hexadecimal strings (both binary and readable ASCII content). I need to search this table for rows containing a specific hex string.
Normally, I would do this:
SELECT * FROM transactionoutputs WHERE outhex LIKE '%74657374%' ORDER BY id DESC OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;
Since the results are paginated, it can take less than a second to find the first 10 results. However, when increasing the offset, or searching for strings that only appear 1-2 times in the entire table, it can take more than a minute, at which point my application will time out.
The execution plan for this query is this:
Are there any easy ways to improve the performance of such a search?
Using this answer, I was able to reduce the query time from 33 seconds to 27 seconds:
SELECT * FROM transactionoutputs WHERE
CHARINDEX('74657374' collate Latin1_General_BIN, outhex collate Latin1_General_BIN) > 0
ORDER BY id DESC OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;
When I leave out the ORDER BY
and pagination, I can reduce this to 19 seconds. This is not ideal because I need both the ordering and pagination. It still has to scan the entire table
I have tried the following:
I came across this article about slow queries. Initially, I was using parameterized queries in my application, which was much slower than running them in SSMS. I have since moved to the query shown above, but it is still slow.
I tried to enable Multiple Active Result Sets (MARS), but without any improvement in query time.
I also tried using Full-Text Search. This seemed to be the most promising solution as text search is exactly what I need. I created a full-text index and can do a similar query like above, but using the index:
SELECT * FROM transactionoutputs WHERE CONTAINS(outhex,'7465') ORDER BY id desc OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY;
This returns results almost instantly. However, when the query is longer than a few characters (often 4), it doesn't return anything. Am I doing something wrong or why is it doing that?
The execution plan:
My understanding is that my case is not the ideal use case for FTS, as it is designed to search in readable text and not hexadecimal strings. Is it possible to use anyway, and if so, how?
After reading dozens of articles and SO posts, I can not confidently say I know how to improve the performance of such queries for my specific use case, if it is even possible at all. So, is there any easy option to improve this?
Upvotes: 2
Views: 1111
Reputation: 7928
First, Kudo's for the fantastic explanation of your problem. This helps you get better answers fast. You should also include DDL, including indexes when possible. This will be come clear as I answer your question.
I'm going to tackle a couple issues with your query which are unrelated to how your parsing your text now and talk about how to handle the string problem later tonight.
Answer Part 1: Unrelated to string parsing
It's quite possible that the way you are searching through the string is the main performance problem. Let's start with the SELECT *
- do you absolutely need all the columns? Specifically, do you absolutely need all the columns that are included in that Key lookup? This is the most important thing to sort out. Let me explain.
You're query is performing a scan against your nonclustered index named outhex-index, then performs a Key lookup to retrieve the rows not included in outhex-index. Key lookups destroy performance, especially against a clustered and nonclustered index with 40,000,000 rows.
If you do need those columns, then you should consider adding them as included columns to your outhex-index index. I say consider because I don't know how many columns nor the data type. Include columns speed up queries by eliminating costly key lookups but they slow down data modification, sometimes dramatically depending on the number/type of indexes. If you need the columns not included in outhex-index and they are big columns (MAX/BLOB/LOB data types, XML, etc) then a covering index is NOT an option. If your don't need them then you she refactor your SELECT *
statement to only include the columns you need.
Full text indexing is not an option here unless you can find a way to lose that sort. Sorting has an N log N complexity which means the sort gets more expensive the more rows you sort. A 40 million row sort should be avoided whenever possible. This will be hard to avoid with full text indexing for reasons which require more time to explain then I have time for. Adding/modifying a 40-million row index can be expensive and take a lot of time. If you do go that route I suggest taking an offline copy of that table to time how long it takes to build. You can also consider adding creating a filtered index if possible to reduce your search area.
I noticed, too, that both queries are getting serial execution plans. I don't know if a parallel plan will help the first query with the key lookup but I know that it will likely help with the second one as there is a sort involved. Parallel execution plans can really speed up sorts. Consider testing your query with OPTION (QUERYTRACEON 8649) or make_parallel() by Adam Machanic.
I'll update this post tonight with some ideas to parse your string faster. One thing you could look into in the meantime is Paul White's clever Trigram Wildcard String Search trick which might be an option.
Upvotes: 1