Herodot Thukydides
Herodot Thukydides

Reputation: 147

Check for unique elements of csv

I would like to check for duplicates in a .csv (structure bellow). Every value in this .csv has to be unique! You can find "a" thrice, but it should be there only once.

###start
a
a;b;
d;e
f;g
h
i;
i
d;b
a

c;i

### end

The progress so far:

import os,glob
import csv
folder_path = "csv_entities/"
found_rows = set()

for filepath in glob.glob(os.path.join(folder_path, "*.csv")):
    with open(filepath) as fin, open("newfile.csv", "w") as fout:
        reader = csv.reader(fin, delimiter=";")
        writer = csv.writer(fout, delimiter=";")
        for row in reader:
            # delete empty list elements
            if "" in row:
                row = row[:-1]
            #delete empt row
            if not row:
                continue
            row = tuple(row)  # make row hashable 
            # don't write if row is there already!
            if row in found_rows:
                continue
            print(row)
            writer.writerow(row)
            found_rows.add(row)

Which results in this csv:

###start
a

a;b

d;e

f;g

h

i

d;b

c;i

###end

The most important question is right now: How can I get rid of the double values?

e.g in the second row there should be only "b" instead of "a;b", because "a" is already in the row before.

Upvotes: 0

Views: 838

Answers (1)

Jean-François Fabre
Jean-François Fabre

Reputation: 140168

your mistake is to consider the rows themselves as unique elements. You have to consider cells as elements.

So use your marker set to mark elements, not rows.

Example with only one input file (using several input files with only one output file makes no sense)

found_values = set()

with open("input.csv") as fin, open("newfile.csv", "w",newline="") as fout:
    reader = csv.reader(fin, delimiter=";")
    writer = csv.writer(fout, delimiter=";")
    for row in reader:
        # delete empty list elements & filter out already seen elements
        new_row = [x for x in row if x and x not in found_values]
        # update marker set with row contents
        found_values.update(row)
        if new_row:
            # new row isn't empty: write it
            writer.writerow(new_row)

the resulting csv file is:

a
b
d;e
f;g
h
i
c

Upvotes: 2

Related Questions