Reputation: 1879
I have been asked about creating a site where some users can upload Microsoft Word documents, then others can then search for uploaded documents that contain certain keywords. The site would be sitting on a Linux server running PHP and MySQL. I'm currently trying to find out if and how I can scrape this text from the documents. If anyone can suggest a good way of going about doing this it would be much appreciated.
Upvotes: 2
Views: 1884
Reputation: 8920
Scraping text from the new docx format is trivial. The file itself is just a zip file, and if you look inside one, you will find a bunch of xml files. The text is contained in word/document.xml within this zip file, and all the actual user-entered text will appear in <w:t> tags. If you extract all text that appears in <w:t> tags, you will have scraped the document.
Upvotes: 4
Reputation: 15780
Here's a good example using catdoc:
function catdoc_string($str)
{
// requires catdoc
// write to temp file
$tmpfname = tempnam ('/tmp','doc');
$handle = fopen($tmpfname,'w');
fwrite($handle,$a);
fclose($handle);
// run catdoc
$ret = shell_exec('catdoc -ab '.escapeshellarg($tmpfname) .' 2>&1');
// remove temp file
unlink($tmpfname);
if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
return false;
}
return trim($ret);
}
function catdoc_file($fname)
{
// requires catdoc
// run catdoc
$ret = shell_exec('catdoc -ab '.escapeshellarg($fname) .' 2>&1');
if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
return false;
}
return trim($ret);
}
Upvotes: 2