felipep
felipep

Reputation: 2512

PHP Word Count with approximate result to Word Counter

I'm programming a small web app to manage texts with external writers, actually the whole thing is great but I have a small problem. And it's related with the word counter.

The writers will be paid based on the number of words in text, the text contains html tags. But the problem is that there are german characters used(Ä, Ö, Ü, ß)

So at the first position I deleted the tags

    $content = strip_tags($content);

then I replace new lines and tabs with simple spaces

    $replace   = array("\r\n", "\n", "\r", "\t");
    $content = str_replace($replace, ' ', $content);

and finally I try to get the number of words

Method 1:

    $characterMap = 'ÄÖÜäöü߀';
    $count = str_word_count($content, 0, $characterMap);

Method 2:

    $to_delete = array('.', ',', ';', "'", '@');
    $content = str_replace($to_delete, '', $content);

    $count = count(preg_split('~[^\p{L}\p{N}\']+~u',$content));

but the results are different to others like the ones from Word, or from CKEditor Plugin word_count.

For example for an Example Text

Word and CkEditor Word Count give 987 Words

Method 1: 968 Words

Method 2: 995 Words

The problem bei the second method are just the - separators by the words, but my question is if there is a better method to find the number of words in a text in php?

Upvotes: 3

Views: 1127

Answers (3)

user557597
user557597

Reputation:

This might give a better approximation for method 2:

 $string = "He€.llo, ho-w€d9   €   are you? fi€ne ÄÖÜäöü߀, and 'ÄÖÜäöü߀ you?";
 $words = preg_split
     ( '/[^\p{L}\p{N}]*\p{Z}[^\p{L}\p{N}]*/u',
         $string
     );
 print( "count = " . count($words) .  "\n\n" );
 print_r($words);

Upvotes: 0

brandonscript
brandonscript

Reputation: 72875

First, you could combine your two replace statements into one -- word count will ignore double spaces. Second, I'm unsure what the objective is of your regex, but it looks mighty strange.

You should be able to simply do this:

$content = strip_tags($content);
$replace = array("\r\n", "\n", "\r", "\t", '.', ',', ';', "'", '@');
$content = str_replace($replace, ' ', $content);
$count = str_word_count($content, 0, $characterMap);

Upvotes: 1

Dibesjr
Dibesjr

Reputation: 496

You could try taking a look at str_word_count and see if that matches up better than your current solutions.

http://php.net/manual/en/function.str-word-count.php

An example of usage being

$Tag  = 'My Name is Gaurav'; 
$word = str_word_count($Tags);
echo $word;

Upvotes: 0

Related Questions