How do I retrieve text from a doc file php

I am trying to retrieve text from a doc file using php. This is the code that I am using:

    function read_doc() {
        foreach (glob("*.doc") as $filename) {

            $file_handle = fopen($filename, "r"); //open the file
            $stream_text = @fread($file_handle, filesize($filename));
            $stream_line = explode(chr(0x0D),$stream_text);
            $output_text = "";
            foreach($stream_line as $single_line){
                $line_pos = strpos($single_line, chr(0x00));
                if(($line_pos !== FALSE) || (strlen($single_line)==0)){
                    $output_text .= "";
                }else{
                    $output_text .= $single_line." ";
                }
            }
            $output_text = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $output_text);
            echo $output_text;
        }
}

I get this result:

HYPERLINK [email protected] [email protected] Y, dXiJ(x(I_TS1EZBmU/xYy5g/GMGeD3Vqq8K)fw9 xrxwrTZaGy8IjbRcXI u3KGnD1NIBs RuKV.ELM2fiVvlu8zH (W uV4(Tn 7_m-UBww_8(/0hFL)7iAs),Qg20ppf DU4p MDBJlC5 2FhsFYn3E6945Z5k8Fmw-dznZxJZp/P,)KQk5qpN8KGbe Sd17 paSR 6Q

Is there some solution which would clear this up so it returns just a string of text from the doc file?

Upvotes: 0

Views: 72

Answers (2)

Adam T
Adam T

Reputation: 675

Parsing an MS Word doc is tough to do with code.

This is because MS embeds a lot of data into their format, making it look like gibberish as you echo out the parsed words/paragraphs.

I recommend you try a package library (from packagist) to help you with this Word-Doc-Parser

Can be easily installed via composer if you have it on your system.

Upvotes: 0

cb0
cb0

Reputation: 8613

Doc files are hard to handle with vanilla php.

Using https://github.com/alchemy-fr/PHP-Unoconv I did acomplish what you need. It will acutally detect different formats and produce you with a nice xml. Docs can be found here

There are also a lot of examples on the web if you search for "unoconv" + "php"

Upvotes: 1

Related Questions