Kevin
Kevin

Reputation: 5082

PHP - Count Distinct Value in a CSV file

I have a csv file with a very large number of item (5000 lines) in this format

storeId,bookId,nb
124,48361,0
124,48363,6
125,48362,8
125,48363,2
126,28933,4
142,55433,6
142,55434,10
171,55871,7
171,55872,6

I need to count the number of stores in the file, so for exemple with the line above the result should be 5. But I need to doo it with 5000 lines so I can't just loop.

How can I achieve that?

I also need too return the max quantity, so 10

I began by converting the file into an array:

if (file_exists($file)) {
            $csv = array_map('str_getcsv', file($file));

            #Stores
            $storeIds = array_column($csv, 0);
            $eachStoreNb = array_count_values($storeIds);
            $storeCount = count($eachStoreNb);

        }

        print_r($storeCount);

Is there a better way to do it? Faster ? Maybe without using the array

Upvotes: 0

Views: 352

Answers (5)

MyICQ
MyICQ

Reputation: 1158

Solution in AWK, to compare the difference. This includes the count of each store as well. AWK should be able to process millions in less than 1 second. I use the same to filter duplicates from a file.

BEGIN{   # Set some variables initially
  FS=","   # field separator for INPUT
  mymax=0  # init variable mymax
}
NR>1 {            # skip the header line, this matches line 2 onwards
  mycount[$1]++   # increase associative array at that position
  if ($3>mymax){  # compare with max
  mymax=$3
  }

}
END{              # finally print results
  for (i in mycount){
    if (length(i)>0){
       print "value " i  " has " mycount[i]
    }
   }
  print "Maximum value is "  mymax
}

Upvotes: 0

Luuk
Luuk

Reputation: 14939

An answer with awk would be:

awk -F, 'BEGIN {getline}
         { a[$1]++; m=$3>m?$3:m }
         END{ for (i in a){ print i, a[i] };
              print "Number of stores",length(a), "max:",m}' testfile
  1. getline to skip the first line
  2. increment the element with the value of the first column $1 in array a with one, and keep the max value in m
  3. loop over the array a and print all counts (optional)
  4. print the total 'Number of stores', and the max value.

output:

124 52
125 52
126 26
142 52
171 52
Number of stores 5 max: 10

Upvotes: 0

SirPilan
SirPilan

Reputation: 4857

What 'nice_dev' says, but a little more compact.

$fp = fopen('<your_file>', 'r');

fseek($fp, strpos($content, "\n") + 1); // skip first line
$stores = [];
while($row = fgetcsv($fp)) {
    $stores[$row[0]] = max([($stores[$row[0]] ?? 0), $row[2]]);
}

Working example.

Upvotes: 0

nice_dev
nice_dev

Reputation: 17805

Faster here would come in the context of micro-optimization, however you can see an improvement in memory usage.

  • You could just read the file line by line instead of collecting all store IDs in an array and then doing an array_count_values() saving you an extra loop and unnecessary linear storage of all duplicate values.
  • Store IDs would just be made as a key for an associative array.
  • For max NB, you can just keep a max variable keeping the track of max value using max() function. Rest is self-explanatory.

Snippet:

<?php

$file = 'test.csv';
if (file_exists($file)) {
    $fp = fopen($file ,'r');
    $max_nb = 0;
    $store_set = [];
    fgetcsv($fp); // ignoring headers
    while(!feof($fp)){
        $row = fgetcsv($fp);
        $store_set[$row[0]] = true;
        $max_nb = max($max_nb,end($row));
    }

    fclose($fp);

    echo "Num Stores : ",count($store_set),"<br/>";
    echo "Max NB : ",$max_nb;

}else{
    echo "No such CSV file found.";
}

Note: For profiling, I suggest you to try both scripts using xdebug

Upvotes: 3

Minh Viet
Minh Viet

Reputation: 61

What if you looped through the file line by line? I mean ...

$datas = [];
$handle = fopen("filename.csv", "r");

$flagFirstLine = true;
while(!feof($handle)){
    //dont read first line
    if($flagFirstLine) continue;
    $flagFirstLine = false;

    $csvLine = fgetcsv($handle);
    $storeID = $csvLine[0];

    $datas[] = $storeID;
}

echo "all row: " . count($datas);
echo "\nnum store: " . count(array_unique($datas));

Upvotes: 0

Related Questions