Ultimate Gobblement
Ultimate Gobblement

Reputation: 1879

How can I scrape MS Word document text on a Linux server?

I have been asked about creating a site where some users can upload Microsoft Word documents, then others can then search for uploaded documents that contain certain keywords. The site would be sitting on a Linux server running PHP and MySQL. I'm currently trying to find out if and how I can scrape this text from the documents. If anyone can suggest a good way of going about doing this it would be much appreciated.

Upvotes: 2

Views: 1884

Answers (2)

ZoFreX
ZoFreX

Reputation: 8920

Scraping text from the new docx format is trivial. The file itself is just a zip file, and if you look inside one, you will find a bunch of xml files. The text is contained in word/document.xml within this zip file, and all the actual user-entered text will appear in <w:t> tags. If you extract all text that appears in <w:t> tags, you will have scraped the document.

Upvotes: 4

Ruel
Ruel

Reputation: 15780

Here's a good example using catdoc:

function catdoc_string($str)
{
    // requires catdoc

    // write to temp file
    $tmpfname = tempnam ('/tmp','doc');
    $handle = fopen($tmpfname,'w');
    fwrite($handle,$a);
    fclose($handle);

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($tmpfname) .' 2>&1');

    // remove temp file
    unlink($tmpfname);

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

function catdoc_file($fname)
{
    // requires catdoc

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($fname) .' 2>&1');

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

Source

Upvotes: 2

Related Questions