Reputation: 5082

PHP - Count Distinct Value in a CSV file

I have a csv file with a very large number of item (5000 lines) in this format

storeId,bookId,nb
124,48361,0
124,48363,6
125,48362,8
125,48363,2
126,28933,4
142,55433,6
142,55434,10
171,55871,7
171,55872,6

I need to count the number of stores in the file, so for exemple with the line above the result should be 5. But I need to doo it with 5000 lines so I can't just loop.

How can I achieve that?

I also need too return the max quantity, so 10

I began by converting the file into an array:

if (file_exists($file)) {
            $csv = array_map('str_getcsv', file($file));

            #Stores
            $storeIds = array_column($csv, 0);
            $eachStoreNb = array_count_values($storeIds);
            $storeCount = count($eachStoreNb);

        }

        print_r($storeCount);

Is there a better way to do it? Faster ? Maybe without using the array

Upvotes: 0

Answers (5)

MyICQ

Reputation: 1158

Solution in AWK, to compare the difference. This includes the count of each store as well. AWK should be able to process millions in less than 1 second. I use the same to filter duplicates from a file.

BEGIN{   # Set some variables initially
  FS=","   # field separator for INPUT
  mymax=0  # init variable mymax
}
NR>1 {            # skip the header line, this matches line 2 onwards
  mycount[$1]++   # increase associative array at that position
  if ($3>mymax){  # compare with max
  mymax=$3
  }

}
END{              # finally print results
  for (i in mycount){
    if (length(i)>0){
       print "value " i  " has " mycount[i]
    }
   }
  print "Maximum value is "  mymax
}

Upvotes: 0

Luuk

Reputation: 14939

An answer with awk would be:

awk -F, 'BEGIN {getline}
         { a[$1]++; m=$3>m?$3:m }
         END{ for (i in a){ print i, a[i] };
              print "Number of stores",length(a), "max:",m}' testfile

getline to skip the first line
increment the element with the value of the first column $1 in array a with one, and keep the max value in m
loop over the array a and print all counts (optional)
print the total 'Number of stores', and the max value.

output:

124 52
125 52
126 26
142 52
171 52
Number of stores 5 max: 10

Upvotes: 0

SirPilan

Reputation: 4857

What 'nice_dev' says, but a little more compact.

$fp = fopen('<your_file>', 'r');

fseek($fp, strpos($content, "\n") + 1); // skip first line
$stores = [];
while($row = fgetcsv($fp)) {
    $stores[$row[0]] = max([($stores[$row[0]] ?? 0), $row[2]]);
}

Working example.

Upvotes: 0

nice_dev

Reputation: 17805

Faster here would come in the context of micro-optimization, however you can see an improvement in memory usage.

You could just read the file line by line instead of collecting all store IDs in an array and then doing an array_count_values() saving you an extra loop and unnecessary linear storage of all duplicate values.
Store IDs would just be made as a key for an associative array.
For max NB, you can just keep a max variable keeping the track of max value using max() function. Rest is self-explanatory.

Snippet:

<?php

$file = 'test.csv';
if (file_exists($file)) {
    $fp = fopen($file ,'r');
    $max_nb = 0;
    $store_set = [];
    fgetcsv($fp); // ignoring headers
    while(!feof($fp)){
        $row = fgetcsv($fp);
        $store_set[$row[0]] = true;
        $max_nb = max($max_nb,end($row));
    }

    fclose($fp);

    echo "Num Stores : ",count($store_set),"<br/>";
    echo "Max NB : ",$max_nb;

}else{
    echo "No such CSV file found.";
}

Note: For profiling, I suggest you to try both scripts using xdebug

Upvotes: 3

Minh Viet

Reputation: 61

What if you looped through the file line by line? I mean ...

$datas = [];
$handle = fopen("filename.csv", "r");

$flagFirstLine = true;
while(!feof($handle)){
    //dont read first line
    if($flagFirstLine) continue;
    $flagFirstLine = false;

    $csvLine = fgetcsv($handle);
    $storeID = $csvLine[0];

    $datas[] = $storeID;
}

echo "all row: " . count($datas);
echo "\nnum store: " . count(array_unique($datas));

Upvotes: 0

PHP - Count Distinct Value in a CSV file

Answers (5)

Related Questions