Reputation: 495
I have a directory of text files. I want to loop through each of the text files in the directory and get the overall count of unique words (count of vocabulary), not for each individual file, but for ALL the files together. In other words, I want the number of unique words within all the files together, and NOT the number of unique words for each individual file.
For example, I have three text files in a directory. Here are their contents:
file1.txt -> here is some text.
file2.txt -> here is more text.
file3.txt -> even more text.
So the count of unique words for this directory of text files in this case is 6.
I have tried to use this code:
$files = glob("C:\\wamp\\dir");
$out = fopen("mergedFiles.txt", "w");
foreach($files as $file){
$in = fopen($file, "r");
while ($line = fread($in)){
fwrite($out, $line);
}
fclose($in);
}
fclose($out);
to merge all the text files and then after using this code I planned to use the array_unique() on mergedFiles.txt. However, the code is not working.
How can I get the unique word count of all the text files in the directory in the best way possible?
Upvotes: 0
Views: 1615
Reputation: 47904
Unless you have legitimate reasons not to simply concatenate the files and process their content as a concatenated string, use this snippet to target txt files in a directory, join their texts, make the text lowercase, isolate words, remove duplicates, then count unique words:
Code (not fully tested on a filesystem): (Demo)
echo count(
array_unique(
str_word_count(
strtolower(
implode(
' ',
array_map(
'file_get_contents',
glob("*.txt")
)
)
),
1
)
)
);
Assuming texts from file:
[
'here is some text.',
'here is more text.',
'even more text.'
]
The output is 6
from a unique array of:
array (
0 => 'here',
1 => 'is',
2 => 'some',
3 => 'text',
6 => 'more',
8 => 'even',
)
Modify the snippet as needed: perhaps use a different technique/algorithm to identify "words", or use mb_strtolower()
, or don't use strtolower()
at all.
Upvotes: 0
Reputation: 1718
You can try this :
$allWords = array();
foreach (glob("*.txt") as $filename) // loop on each file
{
$contents = file_get_contents($filename); // Get file contents
$words = explode(' ', $contents); // Make an array with words
if ( $words )
$allWords = array_merge($allWords, $words); // combine global words array and file words array
}
var_dump(count(array_unique($allWords)));
EDIT Other version which :
function removeDot($string) {
return rtrim($string, '.');
}
$words = explode(' ', preg_replace('#\.([a-zA-Z])#', '. $1', preg_replace('/\s+/', ' ',$contents)));
$words = array_map("removeDot", $words);
Upvotes: 2