Reputation: 519
Okay, I thought my head hurt with regular regex, but I can't seem to find what I'm looking for with regexp in mysql.
I'm trying to look for situations in news articles where a Textile-formatted url has not ended with a slash so:
"Catherine Zeta-Jones":/cr/catherinezeta-jones/ visited stack overflow is ok but "Catherine Zeta-Jones":/cr/catherinezeta-jones visited stack overflow is not.
[just used Catherine as an example because I'm assuming an alpha search wouldn't catch the hyphen]
One of these days I'll have to do that goat sacrifice so I can gain the proper knowledge of regex.
Thanks everyone!
Upvotes: 0
Views: 1341
Reputation: 36622
I think you would end up with the following regex: "((?:[^"]|\\")*)":(\S*/)(?=\s)
. Breaking this down:
"
matches a literal quote.[^"]
, a non-quote character, or\\"
, an escaped quote."
matches a literal quote.:
matches a literal colon.\S*
matches zero or more non-whitespace characters/
matches a literal slash(?=\s)
is a lookahead, matching if the next character is any whitespace character.After this, the first capture group (often $1
) will contain the link text, and $2
will contain the link URL. It will only match links of the format you want. If you want to match all links, just remove the /
, and then remove all the URLs that don't end with one.
Edit: As far as I can tell, MySQL doesn't have \s
, \S
, (?:...
), or (?=...)
, apparently. Replacing \s
and \S
is easy: just use [[:blank:]]
and [^[:blank:]]
. Replacing the (?:...)
s is also easy, since I imagine that in this context, you don't care which groups capture what: just replace all the (?:...)
s with (...)
. Replacing the lookahead is tricky in general, but should be possible in this context: as far as I can tell, you don't actually care if the space at the end is included in the match. Thus, you can just match what you want to look ahead for instead of simply looking ahead for it(or an end-of-string, $
, which I left out of my first answer). To refine things a little, here are two regexen:
"(([^"]|\\")*)":([^[:space:]]*/)([[:space:]]|$)
, which matches Textile URLs which do end in a slash; or"(([^"]|\\")*)":([^[:space:]]*[^/])([[:space:]]|$)
, which matches Textile URLs which don't end in a slash.Edit: What I had last time almost worked. However, consider something like "text":/url/
. The [^[:space:]]*
would grab /url/
, the [^/]
would grab the first space, and the ([[:space:]]|$)
would grab the second, thus telling you that it didn't match. To fix that, we tell it that the last character of the URL must also be a non-space character. Doing that, and removing some redundant parentheses, gives us
"([^"]|\\")*":[^[:space:]]*/([[:space:]]|$)
, which matches Textile URLs which do end in a slash; or"([^"]|\\")*":[^[:space:]]*[^/[:space:]]([[:space:]]|$)
, which matches Textile URLs which don't end in a slash.This is almost good enough. However, it isn't quite. Although punctuation such as ,
is technically legal in a URL, Textile disallows it. However, its rules for when URLs end are rather vague: "common punctuation … can reside at the end of a URL." What's more, it's also possible to enclose links in square brackets, which this regex also won't catch. And that doesn't even begin to deal with link lookup tables.
Modifying things to deal with the ,
case is easy: everywhere you see [:space:]
, replace it with [:space:],
, and include whatever else you need after the ,
. This is because within the above regexen, [:space:]
stands for "illegal URL character". Modifying things to deal with the bracket case is harder, but is probably best accomplished by treating ]
like a comma. And dealing with link lookup tables is not possible with this regex. It may be possible with a regex along the lines of (untested) ^[[:space:]]*\[[^\]]*\].*[^/]$
, but I'm nearly positive that that will miss corner cases (either getting too much, too little, or both), and it probably has bugs of its own. You certainly can't know whether a link that's defined is ever used simply using regexen.
This last complication, then, opens up another question: what precisely are you doing with this? I know right now you're doing a SELECT ... WHERE ...
, but what's that for? The reason I ask is that this task is no longer really suited for regexen. You ought to be using a parser, or something along those lines. And depending upon why you're doing this, that may be possible.
Upvotes: 1