Sumit Nayak
Sumit Nayak

Reputation: 327

How to count words from .doc file using php script?

I have tried many things like How to extract text from word file .doc,docx,.xlsx,.pptx php. But this isn't a solution.

My server is Linux based so enabling extension=php_com_dotnet.dll is not the solution.

Another solution was installing LIBRE office on server and converting the .doc file to .txt on the fly and then counting the words from that file. This is very tedious job and time consuming.

I just need a simple php script that removes the special characters from the .doc file and count the number of words.

Upvotes: 2

Views: 4037

Answers (3)

mimsy
mimsy

Reputation: 39

I've built a tool that incorporates various methods found around the web and on Stack Overflow that provides word, line and page counts for doc, docx, pdf and txt files. I hope it's of use to people. If anyone can get rtf working with it I'd love a pull request! https://github.com/joeblurton/doccounter

Upvotes: 2

Sumit Nayak
Sumit Nayak

Reputation: 327

At the end i had to use Libreoffice. But its very efficient to use it. It solved my all the problem.

So my advice would be to install the 'HEADLESS' package of libreoffice on server and use the command line conversion

Upvotes: 2

clami219
clami219

Reputation: 3038

You can try with this PHP class that claims to be able to convert both .doc and .docx files in textual format.

http://www.phpclasses.org/package/7934-PHP-Convert-MS-Word-Docx-files-to-text.html

According to the example given, that's how you can use it:

require("doc2txt.class.php");

$docObj = new Doc2Txt("test.docx");
//$docObj = new Doc2Txt("test.doc");

$txt = $docObj->convertToText();
echo $txt;

As you pointed out, the core function of this library, as of many others, is something like this:

<?php

 function read_doc($filename)
 {
    $fileHandle = fopen($filename, "r");
    $line = @fread($fileHandle, filesize($filename));
    $lines = explode(chr(0x0D) , $line);
    $outtext = "";
    foreach($lines as $thisline)
        {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE) || (strlen($thisline) == 0))
            {
            }
          else
            {
            $outtext.= $thisline . " ";
            }
        }

    $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/_()]/", "", $outtext);
    return $outtext;
 }

 echo read_doc("sample.doc");

?>

I've tested this function with a .doc file and it seems to work quite well. It needs some fixes with the last part of the document (there is still some random text that is generated at the end of the output), but with some fine tuning it works reasonably.

EDIT: You are right, this functions works correctly only with .docx documents (the document I tested was probably made using the same mechanism). Saving a file with .doc extension, this function doesn't work! The only help I'm able to give you right now is the .doc binary specifications link (here is an even more complete file), where you can actually see how the binary structure is made and extract the informations from there. I can't do it now, so I hope that somebody else may help you through this!

Upvotes: 3

Related Questions