ganlub
ganlub

Reputation: 149

Regex expression with "&nbsp" glued to content

I'm have trouble with a regular expression while I'm trying to capture some data in this HTML:

<ul>
       <li>Nombre de mots à traduire&nbsp;:&nbsp;41 mots.</li>
       <li>Nombre de mots partiellement traduits&nbsp;:&nbsp;164 mots.</li>
       <li>Nombre de mots traduits&nbsp;:&nbsp;792 mots.</li>
       <li>Nombre de correspondances exactes&nbsp;:&nbsp;808 mots.</li>
       <li>Nombre de répétitions internes&nbsp;:&nbsp;71 mots.</li>
       <li>Total&nbsp;:&nbsp;1876 mots.</li>
</ul>

I need to get the quantity of 'mots' for every <li> in PHP Regex, but the &nbsp;:&nbsp; it's glued to the number and I can't get it.

I'm trying to use on the first one (?<=\btraduire&nbsp;:&nbsp;\s)(\w+) but it doesn't make sense... I can't modify the HTML in any way, and I can't use html_entity_decode().

This HTML changes dynamically, I mean the length of this numbers will change, it's just one example.

Any thoughts?

EDIT: Okay with (\d+)\smots I can get it!! =D But if I have:

<p>
    Langue source&nbsp;:&nbsp;FRA (FRA)<br/>
    Langue cible&nbsp;:&nbsp;ESP (ESP)
</p>

And I want to get the "FRA (FRA)" and "ESP (ESP)", any idea?

Upvotes: 0

Views: 226

Answers (2)

Mike Dinescu
Mike Dinescu

Reputation: 55720

If you need the quantity of mots for each <li> you should probably use a Regex like this:

(\d+)\smots

But note however that if you're trying to parse HTML you're probably better off using an HTML parser as regular expressions have a hard time with non-regular syntax (i.e. HTML, XML)

UPDATE

For your second query, I would try something like this:

Langue.*([A-Z]{3})\s\(\1\)

In the above, the first capture group should be the language. The \1 in the last part of the regex refers to the first capture group which means that FRA (FRA) would match, but FRA (BLA) would not.

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can use this:

preg_match_all('~[0-9]+(?= mots.</li>)~', $html, $matches);
print_r($matches);

or more explicit:

preg_match_all('~(?J)<li>(?:Nombre de (?<what>[^&]++)|(?<what>Total))[^0-9]+(?<quantity>[0-9]+)[^<]*</li>~i', $html, $matches, PREG_SET_ORDER);
print_r($matches); 

For your edit:

preg_match_all('~Langue (?<target>[^&\s]++);:&nbsp;\s*(?<language>[^\r\n<]+)\s*~i', $html, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
    printf("\n%s\t%s", $match['target'], $match['language']);
}

Upvotes: 1

Related Questions