Reputation: 4323
I have tried the below expressions.
(http:\/\/.*?)['\"\< \>]
(http:\/\/[-a-zA-Z0-9+&@#\/%?=~_|!:,.;\"]*[-a-zA-Z0-9+&@#\/%=~_|\"])
The first one is doing well but always gives the last extra character with the matched URLs.
Eg:
http://domain.com/path.html"
http://domain.com/path.html<
Notice
" <
I don't want them with URLs.
Upvotes: 0
Views: 131
Reputation: 383866
You can use lookahead instead of making ['\"\< >]
part of your match, i.e.:
(http:\/\/.*?)(?=['\"\< >])
Generally speaking, whereas ab
matches ab
, a(?=b)
matches a
(if it's followed by b
).
Lookarounds are not supported by all flavors. More widely supported are capturing groups.
Generally speaking, whereas (a)b
still matches ab
, it also captures a
in group 1.
Depending on the need, often times using a negated character class is much better than using a reluctant .*?
(followed by a lookahead to assert the terminator pattern in this case).
Let's consider the problem of matching "everything between A
and ZZ
". As it turns out, this specification is ambiguous: we will come up with 3 patterns that does this, and they will yield different matches. Which one is "correct" depends on the expectation, which is not properly conveyed in the original statement.
We use the following as input:
eeAiiZooAuuZZeeeZZfff
We use 3 different patterns:
A(.*)ZZ
yields 1 match: AiiZooAuuZZeeeZZ
(as seen on ideone.com)
iiZooAuuZZeee
A(.*?)ZZ
yields 1 match: AiiZooAuuZZ
(as seen on ideone.com)
iiZooAuu
A([^Z]*)ZZ
yields 1 match: AuuZZ
(as seen on ideone.com)
uu
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
Upvotes: 7
Reputation: 112200
Hmmm, I'd probably do this simply by saying "keep going until you get an unwanted character", like so:
http://[^'"< >]*
Escaped version (based on Q - not sure what engine this is):
http:\/\/[^'\"\< >]*
However the lookahead solution by polygenelubricants is a more flexible way, if you might have some of those characters in the URL (but not at the end).
Upvotes: 1
Reputation: 3620
You need to use "(?=regex)" (lookahead), which lookups a particular pattern, but doesn't include it in the result:
http:\/\/.*?(?=['\"\< >])
Upvotes: 1