Paul Barclay
Paul Barclay

Reputation: 499

Regular expression with negative look aheads

I am trying to contruct a regular expression to remove links from content unless it contains 1 of 2 conditions.

<a.*?href=[""'](http[s]?:\/\/(.*?)\.link\.com)?\/(?!m\/).*?<\/a>

This will match any link to link.com that does not have m/ at the end of the domain section. I want to change this slightly so it does't match URLs that are links to pdf files regardless of having the m/ in the url, I came up with:

<a.*?href=["'](http[s]?:\/\/(.*?)\.brodies\.com)?\/(?!m\/).*?\.(?!pdf)["'].*?<\/a>

Which is ooh so very close except now it will only match if the URL has a "." at the end - I can see why it's doing it. I can't seem to make the "." optional as this causes the non greedy pattern prior to the "." to keep going until it hits the ["']

Any help would be good to help solve this.

Thanks Paul

Upvotes: 0

Views: 68

Answers (2)

Andrew Cheong
Andrew Cheong

Reputation: 30273

First, RegEx match open tags except XHTML self-contained tags.

That said, (since it probably will not deter,) here is a slightly-better-constrained version of what you're trying to, with the caveat that this is still not good enough!

<a[^>]+?href\s*=\s*["'](https?:\/\/[^"']*?\.link\.com)?\/(?!m\/)[^"']*?\.(?!pdf)[^"']*?["'][^>]*?>.*?<\/a>

You can see a running example of this regex at: http://rubular.com/r/obkKrKpB8B.

Your problem was actually just that you were looking for a quote character immediately after the dot, here: .(?!pdf)["'].

Upvotes: 1

Qtax
Qtax

Reputation: 33908

You probably want to use (?<!\.pdf)["'] instead of \.(?!pdf)["'].

But note that this expression has several issues, best way to solve them is to use a proper HTML parser.

Upvotes: 1

Related Questions