Reputation: 57244

DOM xpath to find #text nodes and wrap in paragraph tag

I would like to find all root-level #text nodes (or those with div parents) which should be wrapped inside a  tag. In the following text there should be three (or even just two) final root  tags.

<div>
    This text should be wrapped in a p tag.
</div>

This also should be wrapped.

<b>And</b> this.

The idea is to format the text nicer so that text blocks are grouped into paragraphs for HTML display. However, the following xpath I have been working out seems to fail to select the text nodes.

    <?php

$html = '<div>
    This text should be wrapped in a p tag.
</div>

This also should be wrapped.

<b>And</b> this.';

libxml_use_internal_errors(TRUE);

$dom = DOMDocument::loadHTML($html);

$xp = new DOMXPath($dom);

$xpath = '//text()[not(parent::p) and normalize-space()]';

foreach($xp->query($xpath) as $node) {
    $element = $dom->createElement('p');
    $node->parentNode->replaceChild($element, $node);
    $element->appendChild($node);
}

print $dom->saveHTML();

Upvotes: 10

Answers (4)

pszaba

Reputation: 1064

I know it is not xpath but check this out:

PHP Simple HTML DOM Parser

http://simplehtmldom.sourceforge.net/

Features

A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!

Supports invalid HTML.

Find tags on an HTML page with selectors just like jQuery.

Extract contents from HTML in a single line.

Upvotes: 1

hakre

Reputation: 198123

Your scenario has many edge-cases and the word should is adding on top. I assume you want to do the classic a double break starts a new paragraph thingy, however this time within parent <div> (or certainly other block elements) as well.

I would let do the HTML parser most of the work but I still would work with text search and replace (next to xpath). So what you will see coming is a bit hackish but I think pretty stable:

First of all I would select all text-nodes that are of top-level or child of the said div.

(.|./div)/text()

This xpath is relative to an anchor element which is the <body> tag as it represents the root-tag of your HTML fragment when loaded into DOMDocument.

If child of a div then I would insert the starting paragraph at the very beginning.

Then in any case I would insert a break-mark (here in form of a comment) at each occurrence of the sequence that starts a new paragraph (that should be "\n\n" because of whitespace normalization, I might be wrong and if it doesn't apply, you would need to do the whitespace-normalization upfront to have this working transparently).

/* @var $result DOMText[] */
$result = $xp->query('(.|./div)/text()', $anchor);

foreach ($result as $i => $node)
{
    if ($node->parentNode->tagName == 'div')
    {
        $insertBreakMarkBefore($node, true);
    }

    while (FALSE !== $pos = strpos($node->data, $paragraphSequence))
    {
        $node = $node->splitText($pos + $paragraphSequenceLength);
        $insertBreakMarkBefore($node);
    }
}

These inserted break-marks are just there to be replaced with a HTML  tag. A HTML parser will turn those into adequate ... pairs so I can spare myself writing that algorithm (even though, this might be interesting). This basically work like I once outlined in some other answer but I just don't find the link any longer:

After the modification of the DOM tree, get the innter HTML of the <body> again.
Replace the set marks with "" (here I mark the class as well to make this visible)
Load the HTML fragment into the parser again to re-create the DOM with the proper ... pairs.
Obtain the HTML again from the DOMDocument parser, which now is finally.

These outlined steps in code (skipping some of the function definitions for a moment):

$needle  = sprintf('%1$s<!--%2$s-->%1$s', $paragraphSequence, $paragraphComment);
$replace = sprintf("\n<p class=\"%s\">\n", $paragraphComment);
$html    = strtr($innerHTML($anchor), array($needle . $needle => $replace, $needle => $replace));

echo "HTML afterwards:\n", $innerHTML($loadHTMLFragment($html));

As this shows, double sequences are replaced with a single one. Probably one at the end need to be deleted as well (if applicale, you could also trim whitespace here).

The final HTML output:

<div>
<p class="break">

    This text should be wrapped in a p tag.
</p>
</div>
<p class="break">
This also should be wrapped.
</p>
<p class="break">
<b>And</b> this.</p>

Some more post-production for nice output formatting can be useful, too. Actually I think it's worth to do as it will help you tweak the algorithm (Full Demo - just seeing, whitespace normalization probably does not apply there. so use with care).

Upvotes: 2

CodeWizard

Reputation: 142342

you can do it with pure JavaScript if you wish:

var content = document.evaluate(
                                      '//text()', 
                                      document, 
                                      null, 
                                      XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, 
                                      null );

for ( var i=0 ; i < content .snapshotLength; i++ ){
  console.log( content .snapshotItem(i).textContent );
}

Upvotes: 1

nwellnhof

Reputation: 33658

OK, so let me rephrase my comment as an answer. If you want to match all text nodes, you should simply remove the //div part from your XPath expression. So it becomes:

//text()[not(parent::p) and normalize-space()]

Upvotes: 8

DOM xpath to find #text nodes and wrap in paragraph tag

Answers (4)

Related Questions