Reputation: 1259
I am trying to write a regex that will match urls inside strings of text that may be html-encoded. I am having a considerable amount of trouble with lookaround though. I need something that would correctly match both links in the string below:
some text "http://www.notarealwebsite.com/?q=asdf&searchOrder=1" "http://www.notarealwebsite.com" some other text
A verbose description of what I want would be: "http://" followed by any number of characters that are not spaces, quotes, or the string ""[semicolon]" (I don't care about accepting other non-url-safe characters as delimiters)
I have tried a few regexes using lookahead to check for &'s followed by q's followed by u's and so on, but as soon as I put one into the [^...] negation it just completely breaks down and evaluates more like: "http:// followed by any number of characters that are not spaces, quotes, ampersands, q's, u's, o's, t's, or semicolons" which is obviously not what I am looking for.
This will correctly match the &'s at the beginning of the "[semicolon]:
&(?=q(?=u(?=o(?=t(?=;)))))
But this does not work:
http://[^ "&(?=q(?=u(?=o(?=t(?=;)))))]*
I know just enough about regexes to get into trouble, and that includes not knowing why this won't work the way I want it to. I understand to some extent positive and negative lookaround, but I don't understand why it breaks down inside the [^...]. Is it possible to do this with regexes? Or am I wasting my time trying to make it work?
Upvotes: 0
Views: 93
Reputation: 1861
If your regex implementation supports it, use a positive look ahead and a backreference with a non-greedy expression in the body.
Here is one with your conditions: (["\s]|")(http://.*?)(?=\1)
For example, in Python:
import re
p = re.compile(r'(["\s]|")(https?://.*?)(?=\1)', re.IGNORECASE)
url = "http://test.url/here.php?var1=val&var2=val2"
formatstr = 'text "{0}" more text {0} and more "{0}" test greed"'
data = formatstr.format(url)
for m in p.finditer(data):
print "Found:", m.group(2)
Produces:
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Or in Java:
@Test
public void testRegex() {
Pattern p = Pattern.compile("([\"\\s]|")(https?://.*?)(?=\\1)",
Pattern.CASE_INSENSITIVE);
final String URL = "http://test.url/here.php?var1=val&var2=val2";
final String INPUT = "some text " + URL + " more text + \"" + URL +
"\" more then "" + URL + "" testing greed "";
Matcher m = p.matcher(INPUT);
while( m.find() ) {
System.out.println("Found: " + m.group(2));
}
}
Produces the same output.
Upvotes: 1