Jay
Jay

Reputation: 309

Regular Expressions - Where Angels Fear to Tread

I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.

What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.

My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.

For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?

The entire web page is contained within a single text string and the filtered result should also be a single string of text.

I'm not sure, but I think the code to do this could have a format similar to:

$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);

The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".

Regular expressions are obviously the work of Satan!

Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.

Thanks, Jay

Upvotes: 0

Views: 367

Answers (3)

tobyodavies
tobyodavies

Reputation: 28099

This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser

On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath

The reason god kills kittens when you parse HTML with a regex:

Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.

Edit: thank you @Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!

RegEx match open tags except XHTML self-contained tags

Upvotes: 3

Jason Williams
Jason Williams

Reputation: 57902

Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.

Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:

Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".

When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?

Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)

By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.

Upvotes: 3

David Conde
David Conde

Reputation: 4637

ok, few ground rules.

  • Dont post a question like that, pre-ing all the question, will only keep people away
  • Regular expressions are awsome!
  • If you want to consider options, look on how to read html as an xml document and parse it using xpath
  • @tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways

Now, to your problem. With this one:

$regex = "#<div>(.+?)</div>#si";

You should be ok using that expression and counting the occurences, much like this:

preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );

Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match

if (count($matches) > 5 )
{
   $myMatch = $matches[5][0];
   $matchedText = $matches[5][1];
}

Good luck in your efforts...

Upvotes: 0

Related Questions