modarwish
modarwish

Reputation: 495

Get the count of unique words from all .txt files in a directory

I have a directory of text files. I want to loop through each of the text files in the directory and get the overall count of unique words (count of vocabulary), not for each individual file, but for ALL the files together. In other words, I want the number of unique words within all the files together, and NOT the number of unique words for each individual file.

For example, I have three text files in a directory. Here are their contents:

file1.txt -> here is some text.

file2.txt -> here is more text.

file3.txt -> even more text.

So the count of unique words for this directory of text files in this case is 6.

I have tried to use this code:

$files = glob("C:\\wamp\\dir");

$out = fopen("mergedFiles.txt", "w");


  foreach($files as $file){
      $in = fopen($file, "r");
      while ($line = fread($in)){
           fwrite($out, $line);
      }
      fclose($in);
  }


  fclose($out);

to merge all the text files and then after using this code I planned to use the array_unique() on mergedFiles.txt. However, the code is not working.

How can I get the unique word count of all the text files in the directory in the best way possible?

Upvotes: 0

Views: 1615

Answers (2)

mickmackusa
mickmackusa

Reputation: 47904

Unless you have legitimate reasons not to simply concatenate the files and process their content as a concatenated string, use this snippet to target txt files in a directory, join their texts, make the text lowercase, isolate words, remove duplicates, then count unique words:

Code (not fully tested on a filesystem): (Demo)

echo count(
    array_unique(
        str_word_count(
            strtolower(
                implode(
                    ' ',
                    array_map(
                        'file_get_contents',
                        glob("*.txt")
                    )
                )
            ),
            1
        )
    )
);

Assuming texts from file:

[
    'here is some text.',
    'here is more text.',
    'even more text.'
]

The output is 6 from a unique array of:

array (
  0 => 'here',
  1 => 'is',
  2 => 'some',
  3 => 'text',
  6 => 'more',
  8 => 'even',
)

Modify the snippet as needed: perhaps use a different technique/algorithm to identify "words", or use mb_strtolower(), or don't use strtolower() at all.

Upvotes: 0

fdehanne
fdehanne

Reputation: 1718

You can try this :

$allWords = array();

foreach (glob("*.txt") as $filename) // loop on each file
{
    $contents = file_get_contents($filename); // Get file contents
    $words = explode(' ', $contents); // Make an array with words

    if ( $words )
        $allWords = array_merge($allWords, $words); // combine global words array and file words array
}

var_dump(count(array_unique($allWords)));

EDIT Other version which :

  • remove dots
  • remove multiple spaces
  • match word if missing space between end of sentence and new one.

function removeDot($string) {
    return rtrim($string, '.');
}

$words = explode(' ', preg_replace('#\.([a-zA-Z])#', '. $1', preg_replace('/\s+/', ' ',$contents)));
$words = array_map("removeDot", $words);

Upvotes: 2

Related Questions