Eoin
Eoin

Reputation: 1493

PHP How to remove certain attributes from a body of text

I have the following variable $text which fires out a load of HTML. Most of which is not useful to me for my purposes but some if it is.

HTML that comes out:

<div class="feed-item-description">
<ul>
<li><strong>Impact:</strong>&nbsp;Low</li>
<li><strong>Severity:</strong> <span class="label label-info">Low</span></li>
</ul>
...

What I'd like to do

I'd like to get the impact and the severity rating out of this text. I don't need the label.

I have tried doing this:

$itemAttributes = explode (':' , $text);

$impact     = $itemAttributes[3];
$severity   = $itemAttributes[4];

This does indeed seem to give me the attributes I want, but it also seems to call the word afterwards. It also behaves strangely in that even if I trim it, I cannot get rid of the preceding space from my output.

It also seems to close a <div> behind it, which I can't explain. I'm sure I'm about to get shouted down about using Regex for HTML, but I figured there must be a way to get something so simple out as it's the same words each time preceding the information I want.

If you want to see the actual output on a page you can see it here: https://dev.joomlalondon.co.uk/ you can see in the output I generate that it closes the <div class="feed-item-description"> but I don't tell it to do that anywhere, and the output I use is contained within an <li> not a <div>.

Upvotes: 0

Views: 103

Answers (2)

Nick
Nick

Reputation: 147146

Because you should really use DOMDocument to parse HTML, here's a solution using it:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$feed_items = $xpath->query('//div[contains(@class, "feed-item-description")]');
foreach ($feed_items as $feed_item) {
    $impact_node = $xpath->query('//li[contains(string(), "Impact:")]', $feed_item);
    $impact = preg_replace('/Impact:\W*/', '', $impact_node->item(0)->textContent);
    echo $impact . "\n";
    $severity_node = $xpath->query('//li[contains(string(), "Severity:")]', $feed_item);
    $severity = preg_replace('/Severity:\W*/u', '', $severity_node->item(0)->textContent);
    echo $severity . "\n";
}

Output (for your sample HTML)

Low
Low

Demo on 3v4l.org

Upvotes: 0

Emma
Emma

Reputation: 27723

Maybe,

^\h*(Impact:)\s+(.*)|^\h+(Severity:)\s+(.*)

would simply return those desired values.

Test

$re = '/^\h*(Impact:)\s+(.*)|^\h+(Severity:)\s+(.*)/m';
$str = 'Project: Joomla!
    SubProject: CMS
    Impact: Low
    Severity: Low
    Versions: 3.6.0 - 3.9.12
    Exploit type: Path Disclosure
    Reported Date: 2019-November-01
    Fixed Date: 2019-November-05
    CVE Number: CVE-2019-18674

Description

Missing access check in the phputf8 mapping files could lead to an path disclosure.
Affected Installs

Joomla! CMS versions 3.6.0 - 3.9.12';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

Output

array(2) {
  [0]=>
  array(3) {
    [0]=>
    string(15) "    Impact: Low"
    [1]=>
    string(7) "Impact:"
    [2]=>
    string(3) "Low"
  }
  [1]=>
  array(5) {
    [0]=>
    string(17) "    Severity: Low"
    [1]=>
    string(0) ""
    [2]=>
    string(0) ""
    [3]=>
    string(9) "Severity:"
    [4]=>
    string(3) "Low"
  }
}

If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Upvotes: 1

Related Questions