Ryan Elkins
Ryan Elkins

Reputation: 5797

Regex: Matching HTML elements that do not contain specific text

I need to strip out any movies that are not hosted by YouTube from html. Originally the request was to strip out any movies at all, for which

<object.*object>

worked pretty well. Now I basically need to do the same thing, but only if the stuff in the object tags is not hosted on youtube. I need a regex pattern that will match a string that starts with but does not contain the word "youtube". There are probably more things I would need to consider if I needed this to work with all possible scenarios, but the above should do the trick for the job at hand.

I've been playing with negative lookaheads but have not yet been able to get it to work. Here are some of the things I have tried:

<object.*(?!youtube).*object> - matches all object tags since * is greedy

<object.+?(?!youtube).+?object>

<object(?!youtube)*object>

plus plenty of others that just further reinforce that I'm stabbing wildly in the dark at this one.

This is in Java 1.6

Upvotes: 2

Views: 1201

Answers (2)

Bart Kiers
Bart Kiers

Reputation: 170258

Try:

(?s)<object((?!youtube).)*?object>
  1. (?s) will cause the DOT meta character to match any character (including line breaks)
  2. <object and object> must be clear
  3. (?!youtube). will first check if no youtube can be "seen", and if this is the case, the regex will match any character
  4. ((?!youtube).)*? will match [3] zero or more times, reluctantly ("un-greedy")

Be aware that with regex, it is possible that things can go wrong. For a more robust solution, use a (x)HTML parser to iterate over all object tags and check if "youtube" exists in the attribute or inner-html you expect is to be.

Upvotes: 6

Igor Artamonov
Igor Artamonov

Reputation: 35951

how about make it no so greedy? :) <object.*?(?!youtube).*?object>

Upvotes: 0

Related Questions