JuanFernandoz
JuanFernandoz

Reputation: 799

Finding repeated words in PHP without specifying the word itself

I've been thinking about something for a project I want to do, I'm not an advance user and I'm just learning. Do not know if this is possible:

Suppose we have 100 html documents containing many tables and text inside them.

Question one is: is it possible to analyze all this text and find words repeated and count it?.

Yes, It's possible to do with some functions but here's the problem: what if we did not know the words that will gonna find? That is, we would have to tell the code what a word means.

Suppose, for example, that one word would be a union of seven characters, the idea would be to find other similar patterns and mention it. What would be the best way to do this?

Thank you very much in advance.

Example:

Search: Five characters patterns on the next phrases:

Text one:

"It takes an ocean not to break"

Text two:

"An ocean is a body of saline water"

Result

Takes 1 
Break 1
water 1
Ocean 2

Thanks in advance for your help.

Upvotes: 4

Views: 8326

Answers (2)

Matt
Matt

Reputation: 115

An alternative method using in-built functions that also ignores short words:

   function get_word_counts($text) 
   {
        $words = str_word_count($text, 1);
        foreach ($words as $k => $v) if (strlen($v) < 4) unset($words[$k]); // ignore short words
        $counts = array_count_values($words);
        return $counts;
    }
$counts = get_word_counts($text);
arsort($counts);        
print_r($counts);

Note: this assumes a single block of text, if processing an array of phrases add foreach ($phrases as $phrase) etc

Upvotes: 1

sberry
sberry

Reputation: 131978

function get_word_counts($phrases) {
   $counts = array();
    foreach ($phrases as $phrase) {
        $words = explode(' ', $phrase);
        foreach ($words as $word) {
          $word = preg_replace("#[^a-zA-Z\-]#", "", $word);
            $counts[$word] += 1;
        }
    }
    return $counts;
}

$phrases = array("It takes an ocean of water not to break!", "An ocean is a body of saline water, or so I am told.");

$counts = get_word_counts($phrases);
arsort($counts);
print_r($counts);

OUTPUT

Array
(
    [of] => 2
    [ocean] => 2
    [water] => 2
    [or] => 1
    [saline] => 1
    [body] => 1
    [so] => 1
    [I] => 1
    [told] => 1
    [a] => 1
    [am] => 1
    [An] => 1
    [an] => 1
    [takes] => 1
    [not] => 1
    [to] => 1
    [It] => 1
    [break] => 1
    [is] => 1
)

EDIT
Updated to deal with basic punctuation, based on @Jack's comment.

Upvotes: 8

Related Questions