P.K.
P.K.

Reputation: 825

Incorrect regex for divs

I'm trying to get the divs from many of my website files using regexes, but I'm failing
This is the thing I'm trying to do http://regexr.com/38to9

I need the following div with class data and more, with classes plainText and extData to actually be fitting the regex, everything inside. There's no extra divs inside the ones I listed.
I'm sitting on this for around 2 hours now and I can't figure it out.
It's the following for anyone who doesn't want to go visit that cool site

<div class="data">
    Something
</div>

<div class="data">
     Text in here
    <a class="data" href="links"><img src="whatever.png"></a>
</div>

With regex

\s*<div class="(data|plainText|extData)">\s*(...)\s*<\/div>

The first div is highlighted, the second one isn't. Nor do I get any results with preg_match_all with php. Does it have anything to do with the fact I'm using tabs in the second div and I'm not using them in the first one?
(Wrote it quickly on the website to see if it works)

Upvotes: 0

Views: 36

Answers (2)

zx81
zx81

Reputation: 41848

You have a great non-regex answer, but you should also know that you were really close...

With all disclaimers about parsing html with regex, adding the DOTALL modifier (?s) to your original expression matches what you want:

(?s)<div class="(data|plainText|extData)">\s*(.*?)\s*<\/div>

See demo.

How does this work?

The DOTALL modifier (?s) tells the engine that a dot can match a newline character. This is important for your (.*?) because the content of the divs can span several lines.

Upvotes: 1

Niet the Dark Absol
Niet the Dark Absol

Reputation: 324820

Have you tried using a parser instead?

$dom = new DOMDocument();
$dom->loadHTML($input);
$divs = $dom->getElementsByTagName('div');
foreach($divs as $div) {
  if( preg_match("/\b(data|plainText|extData)\b/",$div->getAttribute("class")) {
    // do something to the $div
    $div->setAttribute("title","I matched!");
  }
}
$out = $dom->saveHTML();

// Because DOMDocument wraps our HTML in a minimal document, we need to extract
// in this case, regex is okay because we have a known structure:
$out = preg_replace("~.*?<body>(.*)</body>.*~","$1",$out);

Upvotes: 2

Related Questions