Xeoncross
Xeoncross

Reputation: 57244

DOM xpath to find #text nodes and wrap in paragraph tag

I would like to find all root-level #text nodes (or those with div parents) which should be wrapped inside a <p> tag. In the following text there should be three (or even just two) final root <p> tags.

<div>
    This text should be wrapped in a p tag.
</div>

This also should be wrapped.

<b>And</b> this.

The idea is to format the text nicer so that text blocks are grouped into paragraphs for HTML display. However, the following xpath I have been working out seems to fail to select the text nodes.

    <?php

$html = '<div>
    This text should be wrapped in a p tag.
</div>

This also should be wrapped.

<b>And</b> this.';

libxml_use_internal_errors(TRUE);

$dom = DOMDocument::loadHTML($html);

$xp = new DOMXPath($dom);

$xpath = '//text()[not(parent::p) and normalize-space()]';

foreach($xp->query($xpath) as $node) {
    $element = $dom->createElement('p');
    $node->parentNode->replaceChild($element, $node);
    $element->appendChild($node);
}

print $dom->saveHTML();

Upvotes: 10

Views: 4741

Answers (4)

pszaba
pszaba

Reputation: 1064

I know it is not xpath but check this out:

PHP Simple HTML DOM Parser

http://simplehtmldom.sourceforge.net/

Features

A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!

Supports invalid HTML.

Find tags on an HTML page with selectors just like jQuery.

Extract contents from HTML in a single line.

Upvotes: 1

hakre
hakre

Reputation: 198123

Your scenario has many edge-cases and the word should is adding on top. I assume you want to do the classic a double break starts a new paragraph thingy, however this time within parent <div> (or certainly other block elements) as well.

I would let do the HTML parser most of the work but I still would work with text search and replace (next to xpath). So what you will see coming is a bit hackish but I think pretty stable:

First of all I would select all text-nodes that are of top-level or child of the said div.

(.|./div)/text()

This xpath is relative to an anchor element which is the <body> tag as it represents the root-tag of your HTML fragment when loaded into DOMDocument.

If child of a div then I would insert the starting paragraph at the very beginning.

Then in any case I would insert a break-mark (here in form of a comment) at each occurrence of the sequence that starts a new paragraph (that should be "\n\n" because of whitespace normalization, I might be wrong and if it doesn't apply, you would need to do the whitespace-normalization upfront to have this working transparently).

/* @var $result DOMText[] */
$result = $xp->query('(.|./div)/text()', $anchor);

foreach ($result as $i => $node)
{
    if ($node->parentNode->tagName == 'div')
    {
        $insertBreakMarkBefore($node, true);
    }

    while (FALSE !== $pos = strpos($node->data, $paragraphSequence))
    {
        $node = $node->splitText($pos + $paragraphSequenceLength);
        $insertBreakMarkBefore($node);
    }
}

These inserted break-marks are just there to be replaced with a HTML <p> tag. A HTML parser will turn those into adequate <p>...</p> pairs so I can spare myself writing that algorithm (even though, this might be interesting). This basically work like I once outlined in some other answer but I just don't find the link any longer:

  1. After the modification of the DOM tree, get the innter HTML of the <body> again.
  2. Replace the set marks with "<p>" (here I mark the class as well to make this visible)
  3. Load the HTML fragment into the parser again to re-create the DOM with the proper <p>...</p> pairs.
  4. Obtain the HTML again from the DOMDocument parser, which now is finally.

These outlined steps in code (skipping some of the function definitions for a moment):

$needle  = sprintf('%1$s<!--%2$s-->%1$s', $paragraphSequence, $paragraphComment);
$replace = sprintf("\n<p class=\"%s\">\n", $paragraphComment);
$html    = strtr($innerHTML($anchor), array($needle . $needle => $replace, $needle => $replace));

echo "HTML afterwards:\n", $innerHTML($loadHTMLFragment($html));

As this shows, double sequences are replaced with a single one. Probably one at the end need to be deleted as well (if applicale, you could also trim whitespace here).

The final HTML output:

<div>
<p class="break">

    This text should be wrapped in a p tag.
</p>
</div>
<p class="break">
This also should be wrapped.
</p>
<p class="break">
<b>And</b> this.</p>

Some more post-production for nice output formatting can be useful, too. Actually I think it's worth to do as it will help you tweak the algorithm (Full Demo - just seeing, whitespace normalization probably does not apply there. so use with care).

Upvotes: 2

CodeWizard
CodeWizard

Reputation: 142342

you can do it with pure JavaScript if you wish:

var content = document.evaluate(
                                      '//text()', 
                                      document, 
                                      null, 
                                      XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, 
                                      null );

for ( var i=0 ; i < content .snapshotLength; i++ ){
  console.log( content .snapshotItem(i).textContent );
}

Upvotes: 1

nwellnhof
nwellnhof

Reputation: 33658

OK, so let me rephrase my comment as an answer. If you want to match all text nodes, you should simply remove the //div part from your XPath expression. So it becomes:

//text()[not(parent::p) and normalize-space()]

Upvotes: 8

Related Questions