NekoLopez
NekoLopez

Reputation: 609

How to fix errors counting words in text plain with PHP?

Thanks to Document Transformations on Filestack I can get a text/plain output from .DOC/.DOCX files. I want to count the number of words only (no numbers nor punctuation symbols) of this output with PHP and display in HTML page. So I have this:

<button type="button" id="load" class="btn btn-md btn-info">LOAD FILES</button>
<br>
<div id="result"></div>

<script src="../vendors/jquery/dist/jquery.min.js"></script>
<script src="https://static.filestackapi.com/v3/filestack.js"></script>
<script>

    function numWordsR(urlk){ 
        $.post("result_filestack.php",{
            molk: urlk //urlk, example: https://process.filestackapi.com/output=format:txt/AXXXXAXeeeeW33A";
        }).done(function(resp){
            $("#result").html(resp);
        });
    }
</script>

And my file result_filestack.php:

$url = $_POST['molk'];
$content = file_get_contents($url); //get txt/plain output content
$onlywords = preg_replace('/[[:punct:]\d]+/', '', $content); //no numbers nor punctuation symbols

function get_num_of_words($string) {
   $string = preg_replace('/\s+/', ' ', trim($string));
   $words = explode(" ", $string);
   return count($words);
}

$numwords = get_num_of_words($onlywords);
echo "<b>TEXT:</b>: ".$onlywords."<br><br>Number of words: ".$numwords;

I obtain this result:

enter image description here

For example, in this case the result says there's 585 words in the text, but if I copy and paste that text in MS Word it says 612 words. I change PHP code to map the text array:

function get_text($string) {
 $string = preg_replace('/\s+/', ' ', trim($string));
 $words = explode(" ", $string);
 return $words;
}

$texto002 = get_text($onlywords);
echo print_r($texto002);

I notice that there are errors counting the words, in some parts is taking two or three words as one:

enter image description here

How can I fix it?

I'd like your help.

Upvotes: 2

Views: 106

Answers (1)

Emmanuel
Emmanuel

Reputation: 447

It could be because the spaces aren't the regular spaces but special characters, experienced this a while back and before exploding the regular space I replaced the entities with space

function get_num_of_words($string) {
   $string = preg_replace('/\s+/', ' ', trim($string));
   $string = str_replace("&nbsp;", " ", $string);
   $string = str_replace("&#160;", " ", $string);

   $words = explode(" ", $string);

   return count($words);
}

Upvotes: 2

Related Questions