Reputation: 111
I am trying to retrieve text from a doc
file using php
. This is the code that I am using:
function read_doc() {
foreach (glob("*.doc") as $filename) {
$file_handle = fopen($filename, "r"); //open the file
$stream_text = @fread($file_handle, filesize($filename));
$stream_line = explode(chr(0x0D),$stream_text);
$output_text = "";
foreach($stream_line as $single_line){
$line_pos = strpos($single_line, chr(0x00));
if(($line_pos !== FALSE) || (strlen($single_line)==0)){
$output_text .= "";
}else{
$output_text .= $single_line." ";
}
}
$output_text = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $output_text);
echo $output_text;
}
}
I get this result:
HYPERLINK [email protected] [email protected] Y, dXiJ(x(I_TS1EZBmU/xYy5g/GMGeD3Vqq8K)fw9 xrxwrTZaGy8IjbRcXI u3KGnD1NIBs RuKV.ELM2fiVvlu8zH (W uV4(Tn 7_m-UBww_8(/0hFL)7iAs),Qg20ppf DU4p MDBJlC5 2FhsFYn3E6945Z5k8Fmw-dznZxJZp/P,)KQk5qpN8KGbe Sd17 paSR 6Q
Is there some solution which would clear this up so it returns just a string
of text from the doc
file?
Upvotes: 0
Views: 72
Reputation: 675
Parsing an MS Word doc is tough to do with code.
This is because MS embeds a lot of data into their format, making it look like gibberish as you echo out the parsed words/paragraphs.
I recommend you try a package library (from packagist) to help you with this Word-Doc-Parser
Can be easily installed via composer
if you have it on your system.
Upvotes: 0
Reputation: 8613
Doc files are hard to handle with vanilla php.
Using https://github.com/alchemy-fr/PHP-Unoconv I did acomplish what you need. It will acutally detect different formats and produce you with a nice xml. Docs can be found here
There are also a lot of examples on the web if you search for "unoconv" + "php"
Upvotes: 1