Reputation: 41

Ignoring Duplicate Rows on CSV

I am trying to read a CSV file and writing the rows in it on to another csv file. My input file has duplicate rows. In output I want only single row. from my sample script you can see that I created a list called readers . This list got all the rows of input csv. Then inside the for loop I am using writer.writerow(readers[1] + ....) which is basically reading the first row following the header. But the problem is this first row is repetitive. How can I tweak my script so it is executed only once ?

for path in glob.glob("out.csv"):
    if path == "out1.csv": continue
    with open(path) as fh:
        readers = list(csv.reader(fh))

        for row in readers:

            if row[8] == 'READ' and row[10] == '1110':

                writer.writerow(readers[1] + [] + [row[2]])
            elif row[8] == 'READ' and row[10] == '1011':
                writer.writerow(readers[1] + [] + [" "] + [" "] + [" "] + [row[2]])
            elif row[8] == 'READ' and row[10] != ('1101', '0111'):
                writer.writerow(readers[1] + [] + [" "] + [row[2]])

Sample Input

    ID No.  Name    Value   RESULTS
      28    Jason   56789   Fail
      28    Jason   56789   Fail
      28    Jason   56789   Fail
      28    Jason   56789   Fail

Upvotes: 0

Answers (3)

Arminius

Reputation: 1169

While the answers above are basically correct, using Pandas for this seems like overkill to me. Simply use a list with the ID column values you already have seen in processing (assuming that the ID column earns its name, otherwise you have to use a combined key). Then just check if you already saw this value and "presto":

ID_COL = 1
id_seen = []
for path in glob.glob("out.csv"):
    if path == "out1.csv": continue
    with open(path) as fh:
        for row in csv.reader(fh):
            if row[ID_COL] not in id_seen:
                id_seen.append(row[ID_COL])
                # write out whatever column you have to
                writer.writerow(readers[1] + [] + [row[2]])

Upvotes: 0

RZRKAL

Reputation: 417

You may use the pandas package. That would be something like this:

import pandas as pd
# Read the file (considering header by default) and save in variable:
table = pd.read_csv()
# Drop the duplicates:
clean_table = table.drop_duplicates()
# Save clean data:
clean_table.to_csv("data_without_duplicates.csv")

You may check the references here, and here

Upvotes: 1

Pedro Ferreira

Reputation: 115

You can use set type to remove duplicates

readers_unique = list(set(readers))

Upvotes: 0

Ignoring Duplicate Rows on CSV

Answers (3)

Related Questions