Reputation: 85
Is there a simple way to find and remove duplicate rows from a CSV file?
Sample test.csv file:
row1 test tyy......
row2 tesg ghh
row2 tesg ghh
row2 tesg ghh
....
row3 tesg ghh
row3 tesg ghh
...
row4 tesg ghh
Expected results:
row1 test tyy......
row2 tesg ghh
....
row3 tesg ghh
...
row4 tesg ghh
Where can I start to accomplish this with PHP?
Upvotes: 2
Views: 12935
Reputation: 364
One line solution:
file_put_contents('newdata.csv', array_unique(file('data.csv')));
Upvotes: 0
Reputation: 38436
A straight-to-the point method would be to read the file in line-by-line and keep track of each row you've previously seen. If the current row has already been seen, skip it.
Something like the following (untested) code may work:
<?php
// array to hold all "seen" lines
$lines = array();
// open the csv file
if (($handle = fopen("test.csv", "r")) !== false) {
// read each line into an array
while (($data = fgetcsv($handle, 8192, ",")) !== false) {
// build a "line" from the parsed data
$line = join(",", $data);
// if the line has been seen, skip it
if (isset($lines[$line])) continue;
// save the line
$lines[$line] = true;
}
fclose($handle);
}
// build the new content-data
$contents = '';
foreach ($lines as $line => $bool) $contents .= $line . "\r\n";
// save it to a new file
file_put_contents("test_unique.csv", $contents);
?>
This code uses fgetcsv()
and uses a space comma as your column-delimiter (based on the sample-data in your question comment).
Storing every line that has been seen, as above, will assure to remove all duplicate lines in the file regardless of whether-or-not they're directly following one another or not. If they're always going to be back-to-back, a more simple method (and more memory conscious) would be to store only the last-seen line and then compare against the current one.
UPDATE (duplicate lines via the SKU-column, not full-line)
Based on sample data provided in a comment, the "duplicate lines" aren't actually equal (though they are similar, they differ by a good number of columns). The similarity between them can be linked to a single column, the sku
.
The following is an expanded version of the above code. This block will parse the first line (column-list) of the CSV file to determine which column contains the sku
code. From there, it will keep a unique list of SKU codes seen and if the current line has a "new" code, it will write that line to the new "unique" file using fputcsv()
:
<?php
// array to hold all unique lines
$lines = array();
// array to hold all unique SKU codes
$skus = array();
// index of the `sku` column
$skuIndex = -1;
// open the "save-file"
if (($saveHandle = fopen("test_unique.csv", "w")) !== false) {
// open the csv file
if (($readHandle = fopen("test.csv", "r")) !== false) {
// read each line into an array
while (($data = fgetcsv($readHandle, 8192, ",")) !== false) {
if ($skuIndex == -1) {
// we need to determine what column the "sku" is; this will identify
// the "unique" rows
foreach ($data as $index => $column) {
if ($column == 'sku') {
$skuIndex = $index;
break;
}
}
if ($skuIndex == -1) {
echo "Couldn't determine the SKU-column.";
die();
}
// write this line to the file
fputcsv($saveHandle, $data);
}
// if the sku has been seen, skip it
if (isset($skus[$data[$skuIndex]])) continue;
$skus[$data[$skuIndex]] = true;
// write this line to the file
fputcsv($saveHandle, $data);
}
fclose($readHandle);
}
fclose($saveHandle);
}
?>
Overall, this method is far-more memory friendly as it doesn't need to save a copy of every line in memory (only the SKU codes).
Upvotes: 13