Eamonn
Eamonn

Reputation: 418

Extract multiple from html

I'm trying to extract the words within the <li> </li> tags below. My regex is working well, but only giving me the first <li>, Lorem ipsum...

I'm reasonably new to regex, and I am aware it would be likely more reliable to do this by traversing the DOM, but in this case regex is prefered. Any ideas what I need to change to get all the results, instead of just the one?

/<div class="foo-bar">[\s\S]+<ul>[\s\S]*?(<li>([\s\S]*?)<\/li>)+[\s\S]*?<\/ul>/

<div class="foo-bar">
    <!-- Other junk -->
    <ul>
        <li>
            Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        </li>
        <li>
            Vestibulum iaculis nibh ac orci imperdiet ultrices.
        </li>
        <li>
            Fusce neque lacus, feugiat eget sapien eget, ullamcorper rutrum mauris.
        </li>
        <li>
            Maecenas in ipsum consectetur, finibus ex et, condimentum turpis.
        </li>
    </ul>
    <!-- Other junk -->
</div>

Upvotes: 1

Views: 60

Answers (3)

ThW
ThW

Reputation: 19502

Use DOM+Xpath not RegEx.

$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);

foreach($xpath->evaluate('//div[@class="foo-bar"]/ul/li') as $li) {
  var_dump($li->textContent);
}

Output:

string(80) "
            Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        "
string(75) "
            Vestibulum iaculis nibh ac orci imperdiet ultrices.
        "
string(95) "
            Fusce neque lacus, feugiat eget sapien eget, ullamcorper rutrum mauris.
        "
string(89) "
            Maecenas in ipsum consectetur, finibus ex et, condimentum turpis.
        "

Upvotes: 1

funilrys
funilrys

Reputation: 815

It'll be better to use the following with preg_match_all(). I just tested it here and it's working.

First preg_match_all the following to get only the content of the `

/<div class="foo-bar">([\s\S]*?)+<ul>([\s\S]*?)<\/ul>([\s\S]*?)<\/div>/

Then preg_match_all the result of the previous preg_match_all with the following to only get the <li> contents

/<li>([\s\S]*?)<\/li>/

Upvotes: 0

Andy
Andy

Reputation: 698

Add the global g flag at the end. For example:

/<div class="foo-bar">[\s\S]+<ul>[\s\S]*?(<li>([\s\S]*?)<\/li>)+[\s\S]*?<\/ul>/g

You may also want the i flag for case-insensitive

Upvotes: 0

Related Questions