Reputation: 742
When reading text from word files, I get the following output. Some weird characters are printed out. Is there any way to remove them?
I use this function to read from docx files
function readDocx() {
// Create new ZIP archive
$zip = new ZipArchive;
$dataFile = 'word/document.xml';
// Open received archive file
if (true === $zip->open($this->doc_path)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Return data without XML formatting tags
$contents = explode('\n',strip_tags($xml->saveXML()));
$text = '';
foreach($contents as $i=>$content) {
$text .= $contents[$i];
}
return $text;
}
$zip->close();
}
// In case of failure return empty string
return "";
}
Upvotes: 1
Views: 318
Reputation: 197624
This is the part I love most:
$contents = explode('\n',strip_tags($xml->saveXML()));
$text = '';
foreach($contents as $i=>$content) {
$text .= $contents[$i];
}
return $text;
No idea where you copied it from, but it's basically:
$text = strip_tags($xml->saveXML());
return $text;
Next to that, saveXML()
returns a string in UTF-8 encoding. Your browser expects something else, so just change the encoding to that something (you should know it).
As I don't know what is probably unknown to you as well, just wrap anything into HTML entities to make this dead-safe:
$text = strip_tags($xml->saveXML());
return htmlentities($text, ENT_QUOTES, 'UTF-8');
The real fix actually would be that you understand what you are sending to the browser and then tell the browser what it is.
Upvotes: 1
Reputation: 14479
This has nothing to do with php... It's a server encoding issue. Look at apache's default encoding setting.
Upvotes: 0