James Taylor
James Taylor

Reputation: 6258

Build Stripped HTML Array from String in PHP

I have a String which looks something like this:

$html_string = "<p>Some content</p><p>separated by</p><p>paragraphs</p>"

I'd like to do some parsing on the content inside the tags, so I think that creating an array from this would be easiest. Currently I'm using a series of explode and implode to achieve what I want:

$stripped = explode('<p>', $html_string);
$joined = implode(' ', $stripped);
$parsed = explode('</p>', $joined);

which in effect gives:

array('Some content', 'separated by', 'paragraphs'); 

Is there a better, more robust way to create an array from HTML tags? Looking at the docs, I didn't see any mention of parsing via a regular expression.

Thanks for your help!

Upvotes: 2

Views: 74

Answers (3)

trincot
trincot

Reputation: 350232

Here is the DOMDocument solution (native PHP), which will also work when your p tags have attributes, or contain other tags like <br>, or have lots of white-space in between them (which is irrelevant in HTML rendering), or contain HTML entities like &nbsp; or &lt;, etc, etc:

$html_string = "<p>Some content</p><p>separated by</p><p>paragraphs</p>";
$doc = new DOMDocument();
$doc->loadHTML($html_string);

foreach($doc->getElementsByTagName('p') as $p ) {
    $paras[] = $p->textContent;
}

// Output array:
print_r($paras);

If you really want to stick with regular expressions, then at least allow tag attributes and HTML entities, translating the latter to their corresponding characters:

$html_string = "<p>Some content &amp; text</p><p>separated&nbsp;by</p><p style='background:yellow'>paragraphs</p>";

preg_match_all('/<p(?:\s.*?)?>\s*(.*?)\s*<\/p\s*>/si', $html_string, $matches);

$paras = $matches[1];
array_walk($paras, 'html_entity_decode');

print_r($paras);

Upvotes: 0

Manuel Mannhardt
Manuel Mannhardt

Reputation: 2201

If its only that simple with no/not much other tags inside the content you can simply use regex for that:

$string = '<p>Some content</p><p>separated by</p><p>paragraphs</p>';

preg_match_all('/<p>([^<]*?)<\/p>/mi', $string, $matches);

var_dump($matches[1]);

which creates this output:

array(3) {
  [0]=>
  string(12) "Some content"
  [1]=>
  string(12) "separated by"
  [2]=>
  string(10) "paragraphs"
}

Keep in mind that this is not the most effective way nor is it the fastest, but its shorter then using DOMDocument or anything like that.

Upvotes: 1

BarakD
BarakD

Reputation: 548

If you need to do some html parsing in php, there is a nice library for that, called php html parser. https://github.com/paquettg/php-html-parser which can give you a jquery like api, to parse html.

an example:

// Assuming you installed from Composer:
require "vendor/autoload.php";
use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->load('<p>Some content</p><p>separated by</p><p>paragraphs</p>');
$pTags = $dom->find('p');
foreach ($pTags as $tag)
{    
    // do something with the html
    $content = $tag->innerHtml;

 }

Upvotes: 0

Related Questions