Get the count of unique words from all .txt files in a directory

Question

I have a directory of text files. I want to loop through each of the text files in the directory and get the overall count of unique words (count of vocabulary), not for each individual file, but for ALL the files together. In other words, I want the number of unique words within all the files together, and NOT the number of unique words for each individual file.

For example, I have three text files in a directory. Here are their contents:

file1.txt -> here is some text.

file2.txt -> here is more text.

file3.txt -> even more text.

So the count of unique words for this directory of text files in this case is 6.

I have tried to use this code:

$files = glob("C:\wamp\dir");

$out = fopen("mergedFiles.txt", "w");


  foreach($files as $file){
      $in = fopen($file, "r");
      while ($line = fread($in)){
           fwrite($out, $line);
      }
      fclose($in);
  }


  fclose($out);

to merge all the text files and then after using this code I planned to use the array_unique() on mergedFiles.txt. However, the code is not working.

How can I get the unique word count of all the text files in the directory in the best way possible?

fdehanne · Accepted Answer

You can try this :

$allWords = array();

foreach (glob("*.txt") as $filename) // loop on each file
{
    $contents = file_get_contents($filename); // Get file contents
    $words = explode(' ', $contents); // Make an array with words

    if ( $words )
        $allWords = array_merge($allWords, $words); // combine global words array and file words array
}

var_dump(count(array_unique($allWords)));

EDIT Other version which :

remove dots
remove multiple spaces
match word if missing space between end of sentence and new one.

function removeDot($string) {
    return rtrim($string, '.');
}

$words = explode(' ', preg_replace('#\.([a-zA-Z])#', '. $1', preg_replace('/\s+/', ' ',$contents)));
$words = array_map("removeDot", $words);

Get the count of unique words from all .txt files in a directory

Answers (2)

Related Questions