mlnyc
mlnyc

Reputation: 2716

Find all rows in Sql Server table containing HTML tags

Suppose I have the following table:

CREATE TABLE test (
    head NVARCHAR(100),
    body NTEXT
)   

And in this table I wanted to find all rows where the body column contains HTML tags. Any HTML tags but the solution should not miss anything. In other words more false positives are better than more false negatives.

Could anyone help me come up with a viable solution? If this was code, I'd use an HTML parser or a regex in lieu of that. However I don't have regex capability in SQL Server out of the box (Ideally this will be compatible with 2005 onwards, but being compatible with 2008 onwards will work as well).

Any help will be much appreciated.

Also, the above is NTEXT. Would your answer change if the above table's BODY columns's type was NVARCHAR(MAX)?

Thanks!

Upvotes: 2

Views: 5753

Answers (3)

Guy Schalnat
Guy Schalnat

Reputation: 1717

As I was interested in not having as many false positives, I took this a little bit further. According to the html 5 standard on start tags, you start with a '<', but after that, you can only have alphanumeric for the tag name. Since we don't have any way to say "one or more", we'll have to settle for the first character or two after the '<', but that could at least help the quality of the results slightly.

So here is one character after the '<', making sure the next character is alpha-numeric (interestingly, I could not find anywhere in the spec that forbids it starting with a number):

like '%<[A-Za-z0-9]%'

If you want to look two characters deep, the second character could also be '/', '>', or a whitespace. The special characters cause problems, so you have to declare a variable, I believe, to get them to happen:

declare @html_pattern varchar(50)
set @html_pattern = '%<[A-Za-z0-9][A-Za-z0-9/> ' + 
    char(9) + char(10) + char(11) + char(13) + ']%'

like @html_pattern

On a slightly different but related idea, you probably should also look for html comments:

like '%<!--'

Having said all that, I admit that I've never actually seen an html tag start with a number, and so I am actually using:

declare @html_pattern varchar(50)
set @html_pattern = '%<[A-Za-z][A-Za-z0-9/> ' + 
    char(9) + char(10) + char(11) + char(13) + ']%'

I just can't recommend it officially here without someone jumping all over me. Also, for the record, the simpler

like '%<[A-Za-z]%'

worked just as well for my data set (which, admittedly, isn't that big, and the only false positives were in the form >[0-9]). I think you could even get away with:

like '%<[a-z]%'

because almost nobody writes tag names with capital letters (notice that I said "almost"), especially if your data does have false positives with capital letters, but I'd rather not take that chance.

As a side note, XML is going to have a different pattern.

And that is as far as I want to go without regular expressions.

Upvotes: 1

JohnnyMontana
JohnnyMontana

Reputation: 94

select...where Body like '%<%' and Body like '%>%'

Upvotes: 0

brazilianldsjaguar
brazilianldsjaguar

Reputation: 1449

I agree with the comments to the question, that using a simple LIKE '%<%' would be sufficient. SQL Server does have regex-like processing using the PATINDEX function. This answer has a few examples, as well as the MSDN link about it.

PATINDEX returns an index, with 0 being a valid index (just to be aware)

Upvotes: 1

Related Questions