weird characters in PHP

Question

When reading text from word files, I get the following output. Some weird characters are printed out. Is there any way to remove them?

enter image description here

I use this function to read from docx files

function readDocx() {
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($this->doc_path)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            // Return data without XML formatting tags

            $contents = explode('
',strip_tags($xml->saveXML()));
            $text = '';
            foreach($contents as $i=>$content) {
                $text .= $contents[$i];
            }
            return $text;
        }
        $zip->close();
    }
    // In case of failure return empty string
    return "";
}

hakre · Accepted Answer

This is the part I love most:

        $contents = explode('
',strip_tags($xml->saveXML()));
        $text = '';
        foreach($contents as $i=>$content) {
            $text .= $contents[$i];
        }
        return $text;

No idea where you copied it from, but it's basically:

        $text = strip_tags($xml->saveXML());
        return $text;

Next to that, saveXML() returns a string in UTF-8 encoding. Your browser expects something else, so just change the encoding to that something (you should know it).

As I don't know what is probably unknown to you as well, just wrap anything into HTML entities to make this dead-safe:

        $text = strip_tags($xml->saveXML());
        return htmlentities($text, ENT_QUOTES, 'UTF-8');

The real fix actually would be that you understand what you are sending to the browser and then tell the browser what it is.

weird characters in PHP

Answers (2)

Related Questions