Reputation: 21197
I am having difficulty doing regular expressions when there is whitespace and carriage returns in between the text.
For example in this case below, how can I get the regular expression to get "<div id="contentleft">
"?
<div id="content">
<div id="contentleft"> <SCRIPT language=JavaScript>
I tried
id="content">(.*?)<SCRIPT
but it doesn't work.
Upvotes: 1
Views: 258
Reputation: 1171
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHTML($html_str);
$xpath = new DOMXPath($dom);
$div = $xpath->query('div[@id="content"]')->item(0);
Please, correct my xpath expression - not sure if it will work...
Upvotes: 0
Reputation: 10872
Take a look into the PCRE modifiers: https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
You can apply the s modifier, like '/id="content">(.*?)<SCRIPT/s'
(Although, watch out, since it changes the way ^
and $
work, too.
Otherwise, you can do '/id="content">((.|\n)*?)<SCRIPT/'
EDIT: oops, wrong modifier...
Upvotes: 1
Reputation: 164769
$s = '<div id="content">
<div id="contentleft"> <SCRIPT language=JavaScript>';
if( preg_match('/id="content">(.*?)<SCRIPT/s', $s, $matches) )
print $matches[1]."\n";
Dot, by default, matches everything but newlines. /s
makes it match everything.
But really, use a DOM parser. You can walk the tree or you can use an XPath query. Think of it like regexes for XML.
$s = '<div id="content">
<div id="contentleft"> <SCRIPT language=JavaScript>';
// Load the HTML
$doc = new DOMDocument();
$doc->loadHTML($s);
// Use XPath to find the <div id="content"> tag's descendants.
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[@id='content']/descendant::*");
foreach( $nodes as $node ) {
// Stop when we see <script ...>
if( $node->nodeName == "script" )
break;
// do what you want with the content
}
XPath is extremely powerful. Here's some examples.
PS I'm sure (I hope) the above code can be tightened up some.
Upvotes: 2
Reputation: 16649
Well, it is a multi line issue so take a look at pattern modifiers:
m (PCRE_MULTILINE) By default, PCRE treats the subject string as consisting of a single "line" of characters (even if it actually contains several newlines). The "start of line" metacharacter (^) matches only at the start of the string, while the "end of line" metacharacter ($) matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl.
When this modifier is set, the "start of line" and "end of line" constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl's /m modifier. If there are no "\n" characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect.
s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.
from http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
Upvotes: 0
Reputation: 655229
Another solution without regular expressions:
$start = 'id="content">';
$end = '<SCRIPT';
if (($startPos = strpos($str, $start)) !== false &&
($endPos = strpos($str, $end, $startPos+1)) !== false) {
$substr = substr($str, $startPos, $endPost-$startPos);
}
Upvotes: 0
Reputation: 338178
Try
id="content">((?:.|\n)*?)<SCRIPT
The usual warning not to parse HTML with regex applies, but you seem to know that already.
Alternatively:
(?<=id="content">)(?:.|\n)*?(?=<SCRIPT)
The dot does not match newline characters by default. One way to get around that is to explicitly allow them. This would work even if the regex flavor you happen to use did not support a "dotall" modifier.
The first regex is equal to your approach, extended by allowing \n
. Your match would be in group 1, you only need to trim it.
The second regex uses zero-width assertions (look-ahead/look-behind) to mark the begin and the end of the match. The match would not contain anything you don't want, no grouping necessary.
Upvotes: 0