Charles Belov
Charles Belov

Reputation: 663

How to match on URLs outside of tags but not within tags in SQL-friendly regex

I'm trying to detect use of URLs that would be visible on a web page, while ignoring URLs inside a tag. This will be run against specific database fields in an SQL query, so lookahead is, to my understanding, not available to me. The fields to be searched contain HTML page fragments.

The problem I'm encountering is that URLs inside tags are matching when they're not supposed to.

Details follow:

I am running the following regex:

[^=\"\/]([Hh][tT][tT][pP][sS]?:\/\/)?([\w]+\.)*[\w]+\.([A-za-z]+)

against the following strings. I expect the 5 strings beginning with a + to not match (be considered good) and the 7 strings beginning with a - to match (be considered bad/reportable):

(Update: the + and - are just to show which strings I expect to be considered good or bad. They will not neccessarily be present in the data to be examined and are not relevant to the question.)

(Second update: The whole expression is wrong to begin with. SQL apparently requires POSIX, so I replaced the expression with:

[^.=..".](https?[.:.][./.][...])?(:alnum:)*[:alnum:]+[.period.][:alpha:][:alpha:]+

but it's still not working. I don't have the details for what's matching and what isn't, but it's still matching on code inside the HTML tags.)

+<a href="http://www.sfmta.com">text is okay</a>    
+<a href="http://WWW.SFMTA.COM" title="Commercial">com</a>    
+text is okay    
+text is...okay    
+000.000.0000    
-text is okay com    
-text is bad HTTP://WWW.SFMUNI.COM    
-<a href="http://www.sfmta.com">http://www.sfmuni.com</a>    
-<a href="http://www.sfmta.com">www.sfmuni.com</a>    
-<a href="http://www.sfmta.com">sfmuni.com</a>    
-text is bad www.sfmuni.com    
-text is bad sfmuni.com    

For some reason, all of the URLs within tags are matching when none of them are supposed to match.

From [PHP Live Regex] [1], results (interspersed with the original strings):

1.

+<a href="http://www.sfmta.com">text is okay</a>    

Expected no match, but matches

Array
(
    [0] => www.sfmta.com
    [1] => 
    [2] => ww.
    [3] => com
)

2.

+<a href="http://WWW.SFMTA.COM" title="Commercial">com</a>    

Expected no match, but matches

Array
(
    [0] => WWW.SFMTA.COM
    [1] => 
    [2] => WW.
    [3] => COM
)

3.

+text is okay    

Does not match, as expected

Array
(
)

4.

+text is...okay    

Does not match, as expected

Array
(
)

5.

+000.000.0000    

Does not match, as expected

Array

(
)

6.

-text is okay com    

Does not match, as expected

Array
(
)

7.

-text is bad HTTP://WWW.SFMUNI.COM    

Matches, as expected

Array
(
    [0] =>  HTTP://WWW.SFMUNI.COM
    [1] => HTTP://
    [2] => WWW.
    [3] => COM
)

8.

-<a href="http://www.sfmta.com">http://www.sfmuni.com</a>    

Matches www.sfmta.com, expected match on http://www.sfmuni.com or www.sfmuni.com

Array
(
    [0] => www.sfmta.com
    [1] => 
    [2] => ww.
    [3] => com
)

9.

-<a href="http://www.sfmta.com">www.sfmuni.com</a>    

Matches www.sfmta.com, expected match on www.sfmuni.com

Array
(
    [0] => www.sfmta.com
    [1] => 
    [2] => ww.
    [3] => com
)

10.

-<a href="http://www.sfmta.com">sfmuni.com</a>    

Matches www.sfmta.com, expected match on sfmuni.com

Array
(
    [0] => www.sfmta.com
    [1] => 
    [2] => ww.
    [3] => com
)

11.

-text is bad www.sfmuni.com    

Matches, as expected

Array
(
    [0] =>  www.sfmuni.com
    [1] => 
    [2] => www.
    [3] => com
)

12.

-text is bad sfmuni.com    

Matches, as expected

Array
(
    [0] =>  sfmuni.com
    [1] => 
    [2] => 
    [3] => com
)

How do I get the URLs within the tags to not match? There may be other attributes in addition to href or src in any given tag that has an URL.

I also realize that an URL within text that begins at the very beginning of the field, or if someone types a =, " or / just before an URL within text, these will be missed. These are possible but unlikely. But if someone can catch those in a reasonable-length regex, that would be a bonus. It's more important for me to not match on URLs within tags, however.

  [1]: http://www.phpliveregex.com/

Upvotes: 2

Views: 122

Answers (1)

Adam Silenko
Adam Silenko

Reputation: 3108

I know that's old post, but perhaps someone is looking for example of regexp in MySQL:

SELECT str
, str REGEXP '^([.<.].*[.>.])?(.* )?(https?[.:.][./.]{2})?([a-zA-Z0-9]+[.])+[a-zA-Z]{2,6}( .*)?([.<.].*[.>.])?$' address_in_with_text
, str REGEXP '^([.<.].*[.>.])?(https?[.:.][./.]{2})?([a-zA-Z0-9]+[.])+[a-zA-Z]{2,6}([.<.].*[.>.])?$' address_in_tag
, str REGEXP '^.*([a-zA-Z0-9]+[.])+[a-zA-Z]{2,6}.*$' address_in
, str REGEXP '^.*(https?[.:.][./.]{2}).*' http_in
, str REGEXP '^([.<.].*[.>.]).*$' tag_at_begin
, str REGEXP '^.*([.<.].*[.>.])$' tag_at_end
FROM strings;

example test data: create table strings ( str varchar(255) not null );

insert into strings values ('<a href="http://www.sfmta.com">text is okay</a>'),
('<a href="http://WWW.SFMTA.COM" title="Commercial">com</a>'),    
('text is okay'),    
('text is...okay'),    
('000.000.0000'),    
('text is okay com'),    
('text is bad HTTP://WWW.SFMUNI.COM'),    
('<a href="http://www.sfmta.com">http://www.sfmuni.com</a>'),    
('<a href="http://www.sfmta.com">www.sfmuni.com</a>'),    
('<a href="http://www.sfmta.com">sfmuni.com</a>'),    
('text is bad www.sfmuni.com'),    
('text is bad sfmuni.com'),
('text is bad https://www.sfmuni.com'),   
('<a href="http://WWW.SFMTA.COM" title="Commercial">https://com</a>')   
;

Output:

|                                                               str | address_in_with_text | address_in_tag | address_in | http_in | tag_at_begin | tag_at_end |
|-------------------------------------------------------------------|----------------------|----------------|------------|---------|--------------|------------|
|                   <a href="http://www.sfmta.com">text is okay</a> |                    0 |              0 |          1 |       1 |            1 |          1 |
|         <a href="http://WWW.SFMTA.COM" title="Commercial">com</a> |                    0 |              0 |          1 |       1 |            1 |          1 |
|                                                      text is okay |                    0 |              0 |          0 |       0 |            0 |          0 |
|                                                    text is...okay |                    0 |              0 |          0 |       0 |            0 |          0 |
|                                                      000.000.0000 |                    0 |              0 |          0 |       0 |            0 |          0 |
|                                                  text is okay com |                    0 |              0 |          0 |       0 |            0 |          0 |
|                                 text is bad HTTP://WWW.SFMUNI.COM |                    1 |              0 |          1 |       1 |            0 |          0 |
|          <a href="http://www.sfmta.com">http://www.sfmuni.com</a> |                    1 |              1 |          1 |       1 |            1 |          1 |
|                 <a href="http://www.sfmta.com">www.sfmuni.com</a> |                    1 |              1 |          1 |       1 |            1 |          1 |
|                     <a href="http://www.sfmta.com">sfmuni.com</a> |                    1 |              1 |          1 |       1 |            1 |          1 |
|                                        text is bad www.sfmuni.com |                    1 |              0 |          1 |       0 |            0 |          0 |
|                                            text is bad sfmuni.com |                    1 |              0 |          1 |       0 |            0 |          0 |
|                                text is bad https://www.sfmuni.com |                    1 |              0 |          1 |       1 |            0 |          0 |
| <a href="http://WWW.SFMTA.COM" title="Commercial">https://com</a> |                    0 |              0 |          1 |       1 |            1 |          1 |

Upvotes: 1

Related Questions