Peter
Peter

Reputation: 2702

How to strip a HTML element from a text file with PHP?

I am cleaning up a mess created by Adobe InDesign export feature of ePub files.

MY GOAL: OPTION 1. I want to remove all span elements with class attribute CharOverride-7 but leave the other span elements. OPTION 2. In some cases I want to replace the span.CharOverride-7 with a new element, such as i.

Note, my current manual and time-cconsuming way is to do mass search and replace action, but the input text file is inconsistent (extra spaces and other artifacts).

The input text contains hundreds of p paragraphs which look like this:

    <p class="2"><span class="CharOverride-7">A book title</span><a href="https://aaa.net"><span class="CharOverride-8">https://aaa.net</span></a><span class="CharOverride-7">.</span></p>

    <p class="2"><span class="CharOverride-7">Another book title</span><a href="https://aaa.net"><span class="CharOverride-8">https://aaa.net/</span></a><span class="CharOverride-7">.</span></p>

The desired output should look like this:

OPTION ONE (removal of the element)

<p class="2">A book title<a href="https://aaa.net/"><span class="CharOverride-8">https://aaa.net/</span></a>.</p>

OPTION TWO (replace span.CharOverride with i element)

<p class="2"><i>A book title</i><a href="https://aaa.net/"><span class="CharOverride-8">https://aaa.net</span></a><i>.</i></p>

Upvotes: 0

Views: 103

Answers (2)

Marco
Marco

Reputation: 3641

For option one this way works with using DOMDocument(): https://www.php.net/manual/de/class.domdocument.php

<?php
$yourHTML = '<p class="2"><span class="CharOverride-7">A book title</span><a href="https://aaa.net"><span class="CharOverride-8">https://aaa.net</span></a><span class="CharOverride-7">.</span></p>';
$dom      = new DOMDocument();
$dom->loadHTML($yourHTML, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED );

foreach ($dom->getElementsByTagName('span') as $span) {
    if ($span->attributes["class"]->value == "CharOverride-7") {
        $newelement = $dom->createTextNode($span->textContent);
        $span->parentNode->replaceChild($newelement, $span);
    }
}

$ret = $dom->saveHTML();

// <p class="2">A book title<a href="https://aaa.net"><span class="CharOverride-8">https://aaa.net</span></a>.</p>
echo $ret;

Upvotes: 3

Simon K
Simon K

Reputation: 1523

Here's a simple approach for you using preg_replace()...

<?php

$data = file_get_contents('[YOUR FILENAME HERE]');

$result1 = preg_replace('/<span class="CharOverride-7">(.*)<\/span>/U', '$1', $data);
//$result2 = preg_replace('/<span class="CharOverride-7">(.*)<\/span>/U', '<i>$1</i>', $data);

echo $result1; 
// echo $result2;

// Overwrite your file here... (Beyond scope of this question)

Just use $result1 or $result2 at your leisure.

Regex101 Sandbox

Upvotes: 0

Related Questions