user3814982
user3814982

Reputation: 25

Count total and unique words from thousands of files

I have a large collection of text files over 5000 and there are more than 200,000 words. The problem is, when I try to combine the whole collection into a single array in order to find the unique words in the collection no output is shown(It is due to the very large size of array). The following piece of code works fine for small no. of collection e.g., 30 files but cannot operate on the very large collection. Help me fix this problem. Thanks

<?php
ini_set('memory_limit', '1024M');
$directory = "archive/";
$dir = opendir($directory);
$file_array = array(); 
while (($file = readdir($dir)) !== false) {
  $filename = $directory . $file;
  $type = filetype($filename);
  if ($type == 'file') {
    $contents = file_get_contents($filename);
    $text = preg_replace('/\s+/', ' ',  $contents);
    $text = preg_replace('/[^A-Za-z0-9\-\n ]/', '', $text);
    $text = explode(" ", $text);
    $text = array_map('strtolower', $text);
    $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "is", "to");
    $text = (array_diff($text,$stopwords));
    $file_array = array_merge($file_array,  $text);
  }
}
closedir($dir); 
$total_word_count = count($file_array);
$unique_array = array_unique($file_array);
$unique_word_count = count($unique_array);
echo "Total Words: " . $total_word_count."<br>";
echo "Unique Words: " . $unique_word_count;
?> 

Dataset of text files can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip

Upvotes: 2

Views: 1857

Answers (4)

nl-x
nl-x

Reputation: 11832

In stead of juggling with multiple arrays, just build one, and populate it only with the words and count them while you are inserting them. This will be faster, and you will even have the count per word.

By the way, you also need to add the empty string to the list of stopwords, or adjust your logic to avoid taking that one in.

<?php
$directory = "archive/";
$dir = opendir($directory);
$wordcounter = array();
while (($file = readdir($dir)) !== false) {
  if (filetype($directory . $file) == 'file') {
    $contents = file_get_contents($directory . $file);
    $text = preg_replace('/\s+/', ' ',  $contents);
    $text = preg_replace('/[^A-Za-z0-9\-\n ]/', '', $text);
    $text = explode(" ", $text);
    $text = array_map('strtolower', $text);
    foreach ($text as $word)
        if (!isset($wordcounter[$word]))
            $wordcounter[$word] = 1;
        else
            $wordcounter[$word]++;
  }
}
closedir($dir); 

$stopwords = array("", "a", "an", "and", "are", "as", "at", "be", "by", "for", "is", "to");
foreach($stopwords as $stopword)
    unset($wordcounter[$stopword]);

$total_word_count = array_sum($wordcounter);
$unique_word_count = count($wordcounter);
echo "Total Words: " . $total_word_count."<br>";
echo "Unique Words: " . $unique_word_count."<br>";

// bonus:
$max = max($wordcounter);
echo "Most used word is used $max times: " . implode(", ", array_keys($wordcounter, $max))."<br>";
?>

Upvotes: 1

Daniel W.
Daniel W.

Reputation: 32300

A different approach is to load everything into a db table and then let the database server handle the most.

Or process rows in chunks and mark finished rows or aggregate them into another table.

Upvotes: 0

feeela
feeela

Reputation: 29932

Do not increase the memory limit to high. This is typically not the best solution.

What you should so is to load the file line by line (which is easy in PHP when dealing with formats as CSV), compute that single line (or a small bunch a lines) and write to an output file. That way you can work on enormous amounts of input data with a small memory usage.

In any case try to find a way to split up the complete input into smaller chunks that can be worked on even without increasing the memory limit.

Upvotes: 0

Justinas
Justinas

Reputation: 43479

Why combine all of arrays to one big useless array?

You can use array_unique function to get unique values from array, than join it with next array from file and apply same function again.

Upvotes: 0

Related Questions