Reputation: 354
How can i parse with PHP a .doc file "Microsoft Word 97-2004 document"?
I can parse "normal" .doc files with
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
//print_r($outtext);die();
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}
but that doesn't work with Microsoft Word 97-2004 .doc files. I just want to extract the pure text. Nothing else.
--> Solution is PHPWord like Mark Baker recommends in his comment.
Upvotes: 2
Views: 741
Reputation: 354
At the end i had to install linux catdoc 0.94.2 to resolve the problem. PHPWord couldn't convert all the files in a correct way to pure .txt format.
So here's a solution for linux (for example Ubuntu or Debian-like) users: On command line install catdoc
sudo apt-get install catdoc
If you are on a Windows Server, have a look at this. It worked also for me:
http://blog.brush.co.nz/2009/09/catdoc-windows/
Then in your PHP Code you can call it like this (for Linux calls):
$escapeFile = escapeshellarg($data['tmp_name']);
$command = "catdoc $escapeFile";
$output = array();
exec($command,$output);
$text = implode("\n",$output);
then you can do for example
$text = strip_tags($text);
$text = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $text);
echo nl2br($text) ;
to see the result on screen.
That's what for me works best up to now. If someone has a better solution, please tell me.
Upvotes: 1