mart
mart

Reputation: 354

how to parse a Microsoft Word 97-2004 .doc file with PHP

How can i parse with PHP a .doc file "Microsoft Word 97-2004 document"?

I can parse "normal" .doc files with

private function read_doc() {
    $fileHandle = fopen($this->filename, "r");
    $line = @fread($fileHandle, filesize($this->filename));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     //print_r($outtext);die();
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
}

but that doesn't work with Microsoft Word 97-2004 .doc files. I just want to extract the pure text. Nothing else.

--> Solution is PHPWord like Mark Baker recommends in his comment.

Upvotes: 2

Views: 741

Answers (1)

mart
mart

Reputation: 354

At the end i had to install linux catdoc 0.94.2 to resolve the problem. PHPWord couldn't convert all the files in a correct way to pure .txt format.

So here's a solution for linux (for example Ubuntu or Debian-like) users: On command line install catdoc

sudo apt-get install catdoc

If you are on a Windows Server, have a look at this. It worked also for me:

http://blog.brush.co.nz/2009/09/catdoc-windows/

Then in your PHP Code you can call it like this (for Linux calls):

$escapeFile = escapeshellarg($data['tmp_name']);
$command = "catdoc $escapeFile";
$output = array();
exec($command,$output);
$text = implode("\n",$output);

then you can do for example

$text = strip_tags($text);
$text = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $text);
echo nl2br($text) ;

to see the result on screen.

That's what for me works best up to now. If someone has a better solution, please tell me.

Upvotes: 1

Related Questions