eylay
eylay

Reputation: 2160

Generate pure text using php

I'm using a service that I end up with a generated string. Strings are usually like:

Hello   Mr   John Doe, you are now registered \t.
Hello &nbsb; Mr   John Doe, your phone number is &nbsb; 555-555-555 &nbs; \n

I need to remove all html entities and all \t and \n and etc.

I can use html_entity_decode, to remove none breaking spaces, and use str_replace for removing \t or \n, but is there a more general way? some thing that makes you sure nothing but alphabet characters exist in the string (some string that doesn't contain codes).

Upvotes: 3

Views: 296

Answers (1)

Álvaro González
Álvaro González

Reputation: 146460

If I understood your case correctly, you basically want to convert from HTML to plain text.

Depending on the complexity of your input and the robustness and accuracy needed, you have a couple of options:

  • Use strip_tags() to remove HTML tags, mb_convert_encoding() with HTML-ENTITIES as source encoding to decode entities and either strtr() or preg_replace() to make any additional replacement:

    $html = "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
        Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
        Test: &euro;/&eacute;</p>";
    
    $plain_text = $html;
    $plain_text = strip_tags($plain_text);
    $plain_text = mb_convert_encoding($plain_text, 'UTF-8', 'HTML-ENTITIES');
    $plain_text = strtr($plain_text, [
        "\t" => ' ',
        "\r" => ' ',
        "\n" => ' ',
    ]);
    $plain_text = preg_replace('/\s+/u', ' ', $plain_text);
    
    var_dump($html, $plain_text);
    
  • Use a proper DOM parser, plus maybe preg_replace() for further tweaking:

    $html = "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
        Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
        Test: &euro;/&eacute;</p>";
    
    $dom = new DOMDocument();
    libxml_use_internal_errors(true);
    $dom->loadHTML($html);
    libxml_use_internal_errors(false);
    $xpath = new DOMXPath($dom);
    
    $plain_text = '';
    foreach ($xpath->query('//text()') as $textNode) {
        $plain_text .= $textNode->nodeValue;
    }
    $plain_text = preg_replace('/\s+/u', ' ', $plain_text);
    
    var_dump($html, $plain_text);
    

Both solutions should print something like this:

string(169) "<p>Hello &nbsp; Mr &nbsp; John Doe, you are now registered.
    Hello &nbsp; Mr &nbsp; John Doe, your phone number is &nbsp; 555-555-555 &nbsp;
    Test: &euro;/&eacute;</p>"
string(107) "Hello Mr John Doe, you are now registered. Hello Mr John Doe, your phone number is 555-555-555 Test: €/é"

Upvotes: 2

Related Questions