How to remove duplicate rows from a CSV file?

Question

Is there a simple way to find and remove duplicate rows from a CSV file?

Sample test.csv file:

row1 test tyy......
row2 tesg ghh
row2 tesg ghh
row2 tesg ghh
....
row3 tesg ghh
row3 tesg ghh
...
row4 tesg ghh

Expected results:

row1 test tyy......
row2 tesg ghh
....
row3 tesg ghh
...
row4 tesg ghh

Where can I start to accomplish this with PHP?

newfurniturey · Accepted Answer

A straight-to-the point method would be to read the file in line-by-line and keep track of each row you've previously seen. If the current row has already been seen, skip it.

Something like the following (untested) code may work:

 $bool) $contents .= $line . "
";

// save it to a new file
file_put_contents("test_unique.csv", $contents);
?>

This code uses fgetcsv() and uses a ~~space~~ comma as your column-delimiter (based on the sample-data in your ~~question~~ comment).

Storing every line that has been seen, as above, will assure to remove all duplicate lines in the file regardless of whether-or-not they're directly following one another or not. If they're always going to be back-to-back, a more simple method (and more memory conscious) would be to store only the last-seen line and then compare against the current one.

UPDATE (duplicate lines via the SKU-column, not full-line)
Based on sample data provided in a comment, the "duplicate lines" aren't actually equal (though they are similar, they differ by a good number of columns). The similarity between them can be linked to a single column, the sku.

The following is an expanded version of the above code. This block will parse the first line (column-list) of the CSV file to determine which column contains the sku code. From there, it will keep a unique list of SKU codes seen and if the current line has a "new" code, it will write that line to the new "unique" file using fputcsv():

 $column) {
                    if ($column == 'sku') {
                        $skuIndex = $index;
                        break;
                    }
                }
                if ($skuIndex == -1) {
                    echo "Couldn't determine the SKU-column.";
                    die();
                }
                // write this line to the file
                fputcsv($saveHandle, $data);
            }

            // if the sku has been seen, skip it
            if (isset($skus[$data[$skuIndex]])) continue;
            $skus[$data[$skuIndex]] = true;

            // write this line to the file
            fputcsv($saveHandle, $data);
        }
        fclose($readHandle);
    }
    fclose($saveHandle);
}
?>

Overall, this method is far-more memory friendly as it doesn't need to save a copy of every line in memory (only the SKU codes).

How to remove duplicate rows from a CSV file?

Answers (2)

Related Questions