Reputation: 545

Best SQL query for list of records containing certain characters?

I'm working with a relatively large SQL Server 2000 DB at the moment. It's 80 GB in size, and have millions and millions of records.

I currently need to return a list of names that contains at least one of a series of illegal characters. By illegal characters is just meant an arbitrary list of characters that is defined by the customer. In the below example I use question mark, semi-colon, period and comma as the illegal character list.

I was initially thinking to do a CLR function that worked with regular expressions, but as it's SQL server 2000, I guess that's out of the question.

At the moment I've done like this:

select x from users
where 
columnToBeSearched like '%?%' OR
columnToBeSearched like '%;%' OR
columnToBeSearched like '%.%' OR
columnToBeSearched like '%,%' OR
otherColumnToBeSearched like '%?%' OR
otherColumnToBeSearched like '%;%' OR
otherColumnToBeSearched like '%.%' OR
otherColumnToBeSearched like '%,%'

Now, I'm not a SQL expert by any means, but I get the feeling that the above query will be very inefficient. Doing 8 multiple wildcard searches in a table with millions of records, seems like it could slow the system down rather seriously. While it seems to work fine on test servers, I am getting the "this has to be completely wrong" vibe.

As I need to execute this script on a live production server eventually, I hope to achieve good performance, so as not to clog the system. The script might need to be expanded later on to include more illegal characters, but this is very unlikely.

To sum up: My aim is to get a list of records where either of two columns contain a customer-defined "illegal character". The database is live and massive, so I want a somewhat efficient approach, as I believe the above queries will be very slow.

Can anyone tell me the best way for achieving my result? Thanks!

/Morten

Upvotes: 2

Answers (4)

Jon Egerton

Reputation: 41559

It doesn't get used much, but the LIKE statement accepts patterns in a similar (but much simplified) way to Regex. This link is the msdn page for it.

In your case you could simplify to (untested):

select x from users
where 
    columnToBeSearched like '%[?;.,]%' OR
    otherColumnToBeSearched like '%[?;.,]%'

Also note that you can create the LIKE pattern as a variable, allowing for the customer defined part of your requirements.

One other major optimization: If you've got an updated date (or timestamp) on the user row (for any audit history type of thing), then you can always just query rows updated since the last time you checked.

Upvotes: 6

Andrew

Reputation: 336

Look up PATINDEX it allows you to put in an array of characters PATINDEX('[._]',ColumnName) returns a 0 or a value of the first occurance of an illegal character found in a certain value. Hope this helps.

Upvotes: 0

Aasmund Eldhuset

Reputation: 37950

If this is a query that will be run repeatedly, you are probably better off creating an index for it. The syntax escapes me at the moment, but you could probably create a computed column (edit: probably a PERSISTED computed column) which is 1 if columnToBeSearched or otherColumnToBeSearched contain illegal characters, and 0 otherwise. Create an index on that column and simply select all rows where the column is 1. This assumes that the set of illegal characters is fixed for that database installation (I assume that that's what you mean by "specified by the customer"). If, on the other hand, each query might specify a different set of illegal characters, this won't work.

By the way, if you don't mind the risk of reading uncommitted rows, you can run the query in a transaction with the the isolation level READ UNCOMMITTED, so that you won't block other transactions.

Upvotes: 3

Paolo Falabella

Reputation: 25844

You can try to partition your data horizontally and "split" your query in a number of smaller queries. For instance you can do

SELECT x FROM users 
WHERE users.ID BETWEEN 1 AND 5000 
AND -- your filters on columnToBeSearched

putting your results back together in one list may be a little inconvenient, but if it's a report you're only extracting once (or once in a while) it may be feasible. I'm assuming ID is the primary key of users or a column that has a index defined, which means SQL should be able to create an efficient execution plan, where it evaluates users.ID BETWEEN 1 AND 5000 (fast) before trying to check the filters (which may be slow).

Upvotes: 0

Best SQL query for list of records containing certain characters?

Answers (4)

Related Questions