Reputation: 41
I am trying to read a CSV file and writing the rows in it on to another csv file. My input file has duplicate rows. In output I want only single row. from my sample script you can see that I created a list called readers . This list got all the rows of input csv. Then inside the for loop I am using writer.writerow(readers[1] + ....) which is basically reading the first row following the header. But the problem is this first row is repetitive. How can I tweak my script so it is executed only once ?
for path in glob.glob("out.csv"):
if path == "out1.csv": continue
with open(path) as fh:
readers = list(csv.reader(fh))
for row in readers:
if row[8] == 'READ' and row[10] == '1110':
writer.writerow(readers[1] + [] + [row[2]])
elif row[8] == 'READ' and row[10] == '1011':
writer.writerow(readers[1] + [] + [" "] + [" "] + [" "] + [row[2]])
elif row[8] == 'READ' and row[10] != ('1101', '0111'):
writer.writerow(readers[1] + [] + [" "] + [row[2]])
Sample Input
ID No. Name Value RESULTS
28 Jason 56789 Fail
28 Jason 56789 Fail
28 Jason 56789 Fail
28 Jason 56789 Fail
Upvotes: 0
Views: 1744
Reputation: 1169
While the answers above are basically correct, using Pandas for this seems like overkill to me. Simply use a list with the ID column values you already have seen in processing (assuming that the ID column earns its name, otherwise you have to use a combined key). Then just check if you already saw this value and "presto":
ID_COL = 1
id_seen = []
for path in glob.glob("out.csv"):
if path == "out1.csv": continue
with open(path) as fh:
for row in csv.reader(fh):
if row[ID_COL] not in id_seen:
id_seen.append(row[ID_COL])
# write out whatever column you have to
writer.writerow(readers[1] + [] + [row[2]])
Upvotes: 0
Reputation: 417
You may use the pandas package. That would be something like this:
import pandas as pd
# Read the file (considering header by default) and save in variable:
table = pd.read_csv()
# Drop the duplicates:
clean_table = table.drop_duplicates()
# Save clean data:
clean_table.to_csv("data_without_duplicates.csv")
You may check the references here, and here
Upvotes: 1
Reputation: 115
You can use set type to remove duplicates
readers_unique = list(set(readers))
Upvotes: 0