Reputation: 5797
I need to strip out any movies that are not hosted by YouTube from html. Originally the request was to strip out any movies at all, for which
<object.*object>
worked pretty well. Now I basically need to do the same thing, but only if the stuff in the object tags is not hosted on youtube. I need a regex pattern that will match a string that starts with but does not contain the word "youtube". There are probably more things I would need to consider if I needed this to work with all possible scenarios, but the above should do the trick for the job at hand.
I've been playing with negative lookaheads but have not yet been able to get it to work. Here are some of the things I have tried:
<object.*(?!youtube).*object>
- matches all object tags since * is greedy
<object.+?(?!youtube).+?object>
<object(?!youtube)*object>
plus plenty of others that just further reinforce that I'm stabbing wildly in the dark at this one.
This is in Java 1.6
Upvotes: 2
Views: 1201
Reputation: 170258
Try:
(?s)<object((?!youtube).)*?object>
(?s)
will cause the DOT meta character to match any character (including line breaks)<object
and object>
must be clear(?!youtube).
will first check if no youtube
can be "seen", and if this is the case, the regex will match any character((?!youtube).)*?
will match [3] zero or more times, reluctantly ("un-greedy")Be aware that with regex, it is possible that things can go wrong. For a more robust solution, use a (x)HTML parser to iterate over all object tags and check if "youtube" exists in the attribute or inner-html you expect is to be.
Upvotes: 6
Reputation: 35951
how about make it no so greedy? :) <object.*?(?!youtube).*?object>
Upvotes: 0